I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.
My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.
So for that I followed the steps:
create an input bam file with list the bam files with path directing to it. file called input_bam.list
eg
/home/test/Desktop/bam1.bam
/home/test/Desktop/bam2.bam
/home/test/Desktop/bam3.bamwe have bed files with region and chr
with headers
chr start stop nameI created refgene files as well using
http://genome.ucsc.edu/cgi-bin/hgTables?command=start plus for region using bed file
and sorted the file using following command
sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt
after executing following command:
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE
**error **
MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt
please suggest if I should sort the file with a different command.
If I use the command without refgene
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed
I get the following output files
file_base_name_withbedfile.sample_cumulative_coverage_counts
file_base_name_withbedfile.sample_cumulative_coverage_proportions
file_base_name_withbedfile.sample_interval_statistics
file_base_name_withbedfile.sample_interval_summary
file_base_name_withbedfile.sample_statistics
file_base_name_withbedfile.sample_summary
I don't understand which output file is the best to answer my question fo depth.
In the last output file -- file_base_name_withbedfile.sample_summary
the output looks like
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15
test 1162396121 1775.69 500 500 343 91.7
Total 1162396121 1775.69 N/A N/A N/A
I don't understand what to make of it, and why there are NA
and in file file_base_name_withbedfile.sample_interval_summary
the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.
Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granular_Q3 test_%_above_15
chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0
chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8
chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3
chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9
If this is a redundant question, could anyone direct me to the correct discussion to understand the output.
Thanks in advance.