Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

What does the output of DepthOfCoverage means?

$
0
0

I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.

My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.

So for that I followed the steps:

  1. create an input bam file with list the bam files with path directing to it. file called input_bam.list
    eg
    /home/test/Desktop/bam1.bam
    /home/test/Desktop/bam2.bam
    /home/test/Desktop/bam3.bam

  2. we have bed files with region and chr
    with headers
    chr start stop name

  3. I created refgene files as well using
    http://genome.ucsc.edu/cgi-bin/hgTables?command=start plus for region using bed file

and sorted the file using following command
sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt

  1. after executing following command:

    java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE

**error **

MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt

please suggest if I should sort the file with a different command.

If I use the command without refgene

java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed

I get the following output files

file_base_name_withbedfile.sample_cumulative_coverage_counts
file_base_name_withbedfile.sample_cumulative_coverage_proportions
file_base_name_withbedfile.sample_interval_statistics
file_base_name_withbedfile.sample_interval_summary
file_base_name_withbedfile.sample_statistics
file_base_name_withbedfile.sample_summary

I don't understand which output file is the best to answer my question fo depth.

In the last output file -- file_base_name_withbedfile.sample_summary
the output looks like
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15
test 1162396121 1775.69 500 500 343 91.7
Total 1162396121 1775.69 N/A N/A N/A

I don't understand what to make of it, and why there are NA

and in file file_base_name_withbedfile.sample_interval_summary
the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.

Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granular_Q3 test_%_above_15
chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0
chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8
chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3
chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9

If this is a redundant question, could anyone direct me to the correct discussion to understand the output.

Thanks in advance.


Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>