Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

A USER ERROR has occurred: 'CalculateTargetCoverage' is not a valid command.

$
0
0

I'm using GATK 4.0.4.0 and getting this error: A USER ERROR has occurred: 'CalculateTargetCoverage' is not a valid command. Is that command not compatible with this version, which I have been told on this forum is the best recent version to use?

To clarify, I'm trying to run the following WDL, and my main goal is to get the tangent normalization output files, for use in a downstream task (AllelicCNV). if you have suggestions for which tasks I might update with newer versions I in order to generate these tangent normalized files, I would love to hear from you on how to improve this:

workflow GATK4SomaticCnvToolchainCaptureForSamples {
    String sample_name
    File ref_fasta
    File ref_fasta_index
    File ref_fasta_dict
    File input_bam
    File input_bam_idx

    call gatk4CNVproportionalCoverageForCapture  {
        input: sampleName=sample_name,
        refFasta=ref_fasta,
        refFastaIndex=ref_fasta_index,
        refFastaDict=ref_fasta_dict,
        inputBam=input_bam
    }

    call gatk4CNVtangentNormalizationForCapture {
        input:sampleName=sample_name,
        pCovFile=gatk4CNVproportionalCoverageForCapture.pcov
    }

    call performSegmentation {
        input:sampleName=sample_name,
        tangentNormalizedFile=gatk4CNVtangentNormalizationForCapture.tangentNormalized
    }

    call callSegments {
        input:tangentNormalizedFile=gatk4CNVtangentNormalizationForCapture.tangentNormalized,
        segmentFile=performSegmentation.segFile,
        sampleName=sample_name
    }

    call plotSegmentedCopyRatio {
        input:tangentNormalizedFile=gatk4CNVtangentNormalizationForCapture.tangentNormalized,
        segmentFile=performSegmentation.segFile,
        refFastaDict=ref_fasta_dict,
        sampleName=sample_name,
        preTangentNormalizedFile=gatk4CNVtangentNormalizationForCapture.preTangentNormalized
    }
}

task gatk4CNVproportionalCoverageForCapture {
    File refFasta
    File refFastaIndex
    File refFastaDict
    File inputBam
    String sampleName
    Int memoryGb
    Int diskSpaceGb

    command <<<
        java -jar /gatk/gatk.jar CalculateTargetCoverage \
        --output ${sampleName}.pcov \
        --groupBy SAMPLE \
        --transform PCOV \
        --targetInformationColumns FULL \
        --input ${inputBam} \
        --reference ${refFasta} \
        --cohortName "<ALL>"
    >>>

    output {
        File pcov = "${sampleName}.pcov"
    }

    runtime {
        docker: "broadinstitute/gatk:4.0.4.0"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
    }

}

task gatk4CNVtangentNormalizationForCapture {
    File pCovFile
    File ponFile
    String sampleName
    Int memoryGb
    Int diskSpaceGb

    command <<<
        java -Xmx4g -jar /gatk/gatk.jar NormalizeSomaticReadCounts \
        --input ${pCovFile} \
        --panelOfNormals ${ponFile} \
        --tangentNormalized ${sampleName}.tn.tsv \
        --factorNormalizedOutput ${sampleName}.fnt.tsv \
        --betaHatsOutput ${sampleName}.beta_hats.tsv \
        --preTangentNormalized ${sampleName}.pre_tn.tsv
    >>>

    output {
        File tangentNormalized = "${sampleName}.tn.tsv"
        File factorNormalized = "${sampleName}.fnt.tsv"
        File betaHats = "${sampleName}.beta_hats.tsv"
        File preTangentNormalized = "${sampleName}.pre_tn.tsv"
    }

    runtime {
        docker: "broadinstitute/gatk:4.0.4.0"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
    }
}

task performSegmentation {
    File tangentNormalizedFile
    String sampleName
    Float alpha
    Float eta
    Float trim
    Float undoPrune
    Int nperm
    Int minWidth
    Int kmax
    Int nmin
    Int undoSD
    Boolean log2Input
    Int memoryGb
    Int diskSpaceGb

    command <<<
        java -Xmx4g -jar /gatk/gatk.jar PerformSegmentation \
        --tangentNormalized ${tangentNormalizedFile} \
        --output ${sampleName}.seg \
        --alpha ${alpha} \
        --nperm ${nperm} \
        --pmethod HYBRID \
        --minWidth ${minWidth} \
        --kmax ${kmax} \
        --nmin ${nmin} \
        --eta ${eta} \
        --trim  ${trim} \
        --undoSplits NONE \
        --undoPrune ${undoPrune} \
        --undoSD ${undoSD} \
        --log2Input ${log2Input}
    >>>

    output {
        File segFile = "${sampleName}.seg"
    }

    runtime {
        docker: "broadinstitute/gatk:4.0.4.0"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
    }
}

task callSegments {
    File tangentNormalizedFile
    File segmentFile
    String sampleName
    Int memoryGb
    Int diskSpaceGb

    command <<<
        java -jar /gatk/gatk.jar CallSegments \
        --tangentNormalized ${tangentNormalizedFile} \
        --segments ${segmentFile} \
        --output ${sampleName}.called
    >>>


    output {
        File calledSegFile = "${sampleName}.called"
    }

    runtime {
        docker: "broadinstitute/gatk:4.0.4.0"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
    }
}

task plotSegmentedCopyRatio {
    File tangentNormalizedFile
    File segmentFile
    File refFastaDict
    String sampleName
    File preTangentNormalizedFile
    Boolean log2Input
    Int memoryGb
    Int diskSpaceGb

    command <<<
        mkdir ${sampleName}

        java -jar /gatk/gatk.jar PlotSegmentedCopyRatio \
        --tangentNormalized ${tangentNormalizedFile} \
        --segments ${segmentFile} \
        --output ${sampleName} \
        --outputPrefix ${sampleName} \
        --preTangentNormalized ${preTangentNormalizedFile} \
        --sequenceDictionaryFile ${refFastaDict} \
        --log2Input ${log2Input}
    >>>

    output {
        File before_after_plot = "${sampleName}/${sampleName}_Before_After.png"
        File before_after_CR_lim_4_plot = "${sampleName}/${sampleName}_Before_After_CR_Lim_4.png"
        File FullGenome_plot = "${sampleName}/${sampleName}_FullGenome.png"
    }

    runtime {
        docker: "broadinstitute/gatk:4.0.4.0"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
    }
}

MergeVcfs - elements of the input Iterators are not sorted according to the comparator

$
0
0

I'm trying to merge VCFs from 86 HaplotypeCaller jobs with GATK MergeVcfs and getting an error:

2018-07-03T13:58:06.462132033Z java.lang.IllegalStateException: The elements of the input Iterators are not sorted according to the comparator htsjdk.variant.variantcontext.VariantContextComparator

VCFs passed validation (GATK ValidateVariants). GATK 4.0.2.0 version was used. GSNAP is used for the alignment.
Here is the command line (majority of VCFs is removed to make the preview shorter and more readable):

/opt/gatk --java-options "-Xmx2048M" MergeVcfs --OUTPUT WES_human_Illumina.pe_.filtered.sorted.vc.vcf --INPUT tasks/cf6dc246-c79d-4c54-8a72-0be160a50b62/vc_GATK_HaplotypeCaller_0_s/WES_human_Illumina.pe_.filtered.sorted.vcf --INPUT tasks/cf6dc246-c79d-4c54-8a72-0be160a50b62/vc_GATK_HaplotypeCaller_1_s/WES_human_Illumina.pe_.filtered.sorted.vcf --INPUT tasks/cf6dc246-c79d-4c54-8a72-0be160a50b62/vc_GATK_HaplotypeCaller_2_s/WES_human_Illumina.pe_.filtered.sorted.vcf --REFERENCE_SEQUENCE /sbgenomics/workspaces/ac2e7439-25bf-4c9f-bd35-2a29e376d2b6/tasks/cf6dc246-c79d-4c54-8a72-0be160a50b62/vc_SBG_FASTA_Indices/Homo_sapiens_assembly38.fasta

Can you tell me why is this exception thrown and how can I mitigate it?

Where can I get a gene list in RefSeq format?

$
0
0

1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''

Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.

4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq

Using the filename from above.

Warning:

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

problems emitting all sites with HaplotypeCaller

$
0
0

Hi,

I'm running GATK v4.0.5.2 and Java v1.8.0_20

I'd like to get a VCF with all callable positions, including invariants. With GATK 3, using UnifiedGenotyper, this process works great. I'm trying to run with GATK 4 and I'm only getting the variant positions. Here is my command:

gatk HaplotypeCaller -R reference.fasta -I ECOLI_renamed_header.bam -O test.vcf --output-mode EMIT_ALL_SITES

The VCF only contains one position:

source=HaplotypeCaller

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test

ADK1 460 . A G 81.28 . AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.09;SOR=2.833 GT:AD:DP:GQ:PL 1/1:0,3:3:9:109,9,0

When I run a comparable command with GATK 3, I get 536 positions reported, with only one of them being a variant.

Am I running this correctly?
thanks,
Jason

About fastaalternatereferencemaker in GATK 4.0

$
0
0

Where is the fastaalternatereferencemaker in GATK4.0?
I want to build a reference with mutation by fastaalternatereferencemaker, but I can not find the commond
of fastaalternatereferencemaker , please tell me where ?
Thank you !

How to keep Sample name in my VCF

$
0
0

Hi, GATK

Will GATK keep the name of multiple samples in a vcf ?
No matter which one sample did I call variants, the tag shows the same tag of my samples showing below.

I also called variants from samtools, it shows the same tag in the vcf files.

Does it mean that I make some mistake before calling variants ?

My work flow command line
Trim
nohup java -jar /opt/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 2 -phred33 -trimlog trimbabbler_0604.txt ../Sample_LGC_KM51_450bp_10/LGC_1103S1sd252A_10_L001_R1.fastq.gz ../Sample_LGC_KM51_450bp_10/LGC_1103S1sd252A_10_L001_R2.fastq.gz babbler10_2521_R1paired.fastq.gz babbler10_2521_R1unpaired.fastq.gz babbler10_2521_R2paired.fastq.gz babbler10_2521_R2unpaired.fastq.gz ILLUMINACLIP:/opt/Trimmomatic-0.36/adapters/TruSeq3_all.fa:2:30:10 SLIDINGWINDOW:4:15 &
Mapping
nohup bwa mem -R'@RG\tID:Sample10\tLB:1103S1sd252AL1\tPL:illumina\tSM:babbler' ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa babbler10_2521_R1paired.fastq.gz babbler10_2521_R2paired.fastq.gz > ../babbler10_mapping/babbler10_2521.sam 2> nohup.out &
fixmate
nohup samtools fixmate -O bam babbler10_2521.sam babbler10_2521_fix.bam &
sort
nohup samtools sort -o babbler10_2521_sorted.bam -O bam -T ./lib_temp2521 babbler10_2521_fix.bam &
index
nohup samtools index babbler12_2521_sorted.bam &
Make intervals
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 4 -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I babbler10_2521_sorted.bam -o babbler10_2521.intervals &
Indelrealigner
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T IndelRealigner -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I babbler10_2521_sorted.bam -targetIntervals babbler10_2521.intervals -o babbler10_2521_IR.bam &
MarkDuplicate
nohup java -Xmx4g -jar /opt/picard-tools-2.4.0/picard.jar MarkDuplicates VALIDATION_STRINGENCY=LENIENT I=babbler10_2521_IR.bam O=babbler10_2521_MD.bam M=babbler10_2521_MD.txt REMOVE_DUPLICATES=ture MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 &
Merge
nohup samtools merge babbler10_merge.bam babbler10_2521_MD.bam babbler10_2522_MD.bam babbler10_2551_MD.bam babbler10_2552_MD.bam babbler10_2561_MD.bam babbler10_2562_MD.bam babbler10_2571_MD.bam babbler10_2572_MD.bam babbler10_2591_MD.bam babbler10_2592_MD.bam babbler10_2761_MD.bam babbler10_2762_MD.bam babbler10_2763_MD.bam babbler10_2764_MD.bam &
Samtools mplie up
nohup sh -c "samtools mpileup -C50 -q25 -uf ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa babbler10_merge.bam | bcftools call -vc | vcfutils.pl varFilter -d6 -D34 > babbler10_samtools.vcf" &
GATK UnifiedGenotyper
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T UnifiedGenotyper -nct 4 -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I babbler10_merge.bam -o babbler10_gatk.vcf --output_mode EMIT_VARIANTS_ONLY &
GATK SelectVariants --concordance
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T SelectVariants -nt 4 -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -V babbler10_samtools.vcf --concordance babbler10_gatk.vcf -o babbler10_con.vcf &
GATK VariantFiltration
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T VariantFiltration -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -o babbler10_con_filt.vcf --filterExpression "QD < 3.0" --filterName "depth_filter" --variant babbler10_con.vcf &
GATK BaseRecalibrator
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T BaseRecalibrator -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I ../babbler10_merge.bam -knownSites babbler10_con_filt.vcf -o babbler10_recal02.grp -cov CycleCovariate -cov ContextCovariate &

GATK PrintReads
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T PrintReads -nct 4 -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I ../babbler10_merge.bam -BQSR babbler10_recal.grp -o babbler10_recall.bam &

Then I got a .bam file which has been recalibrated.
I do recalibration twice, so I might get a more accurate bam file.
I also check the bam file's quality after and before recalibration with AnalyzeCovariate.

How to generate a database SNP with non-model species?

$
0
0

Foer generating a vcf file contain multiple samples, It seems that I need a database SNP or variants
I want to generate vcf file containing multiple samples of birds for analyzing population genomic data with plink,
so I think I can use UnifiedGenotyper to generate the vcf.
From the tooldocs of UG, I know that I may need a database.vcf files to let GATK know how many substitution site in birds genome, but there's no known site of SNPs or INDELs of my research system before.

I already have 40 birds' variants which have been done recalibration two times.
So, can I generate the know site from my samples' vcf files for calling variants containing multiple samples ?

The numbers of variants in GATK SV --concordance output

$
0
0

what is the true meaning of SV --concordance?
I have a vcf produced by GATK UG containing 5162953 variants and a vcf generated by samtools which has 244095 variants.
After SV –concordance for these two files, I get a vcf containing 2664909 variants.
I think it seems abnormal.
From the manual of SV, we know that we can “Select all calls made by both myCalls and theirCalls (useful to take a look at what is consistent between two callers)”
Does it mean that I can get those variants which is in both two vcf files?
If it is true, does that mean the vcf I generated from SV –concordance is not correct? or the SV –concordance use other algorithm to do such work?


SelectVariants Large VCF slow runtime

$
0
0

I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.

Here is the command I am using:

java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn sample.name -selectType SNP -select 'vc.getGenotype("sample.name").isHet()' -select 'vc.getGenotype("sample.name").getAD().1 > 5' -select 'vc.getGenotype("sample.name").getGQ() > 30' -select 'vc.isNotFiltered()'

INDELs undetectable in bam file with UnifiedGenotyper

$
0
0

Hello GATK

Is GATK unable to detect INDELs with recalibrated bam files ?
After one recalibration, I generated variants with bam files by GATK UnifiedGenotyper and Samtools mplieup.
However, the vcf files from samtools contain INDELs but those one from UG do not.

Here's my command to generate vcf with recalibrated bam files from UG.
Is there anything wrong in my command?
nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T UnifiedGenotyper -nct 4 -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I babbler10_recall.bam -o babbler10_recal_gatk01.vcf --output_mode EMIT_VARIANTS_ONLY &
Or there are some other reason making this situation?

Can't get variant and invariant sites with HaplotypeCaller

$
0
0

Dear all,

After a careful (I hope) reading in the documentation and the forums I have reached an impasse and need your help with HC. I want to call variant and invariant sites in barley exome data. However I can't seem to make HC call all sites. I am getting homozygote and heterozygote (1/1, 0/1) calls but nothing more (./.). I need to get every base position because my ultimate goal is to merge the resulting vcf files and create a dbsnp database. Below are some examples of my (many versions of) commands and some output lines with sites missing. Any help is appreciated!

gatk -T HaplotypeCaller -R ~/scratch/Barley_morex_pseudomolecules_reference/barley_morex_pseudomolecules.fasta -I ./ERR753132_filtered_trimmed.RG.sam.cut.bam
-gt_mode DISCOVERY -stand_emit_conf 0.0 -stand_call_conf 0.0 -ip 200 -alleles GENOTYPE_GIVEN_ALLELES -o ERR753132_filtered_trimmed.RG.sam.cut.bam.vcf -ERC BP_RESOLUTION -drf DuplicateRead

gatk -T HaplotypeCaller -R ~/scratch/Barley_morex_pseudomolecules_reference/barley_morex_pseudomolecules.fasta -I ./ERR753132_filtered_trimmed.RG.bam
-gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0.0 -stand_call_conf 0.0 -ip 200 -minReadsPerAlignStart 0 -alleles GENOTYPE_GIVEN_ALLELES -o ERR753132_filtered_trimmed.RG.bam.g.vcf -ERC BP_RESOLUTION -drf DuplicateRead

chr1H 41926 . T TTCCTC 232.80 . AC=2;AF=1.00;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.20;SOR=3.912 GT:AD:DP:GQ:PL 1/1:0,6:6:18:270,18,0
chr1H 44079 . C T 79.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.619;ClippingRankSum=-0.372;DP=55;ExcessHet=3.0103;FS=4.108;MLEAC=1;MLEAF=0.500;MQ=47.36;MQRankSum=-3.895;QD=1.48;
ReadPosRankSum=0.619;SOR=0.200 GT:AD:DP:GQ:PL 0/1:48,6:54:99:108,0,3228
chr1H 44082 . C T 76.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.977;ClippingRankSum=0.041;DP=55;ExcessHet=3.0103;FS=4.108;MLEAC=1;MLEAF=0.500;MQ=47.57;MQRankSum=-3.895;QD=1.42;R
eadPosRankSum=0.206;SOR=0.200 GT:AD:DP:GQ:PL 0/1:48,6:54:99:105,0,3249
chr1H 45251 . T C 323.78 . AC=2;AF=1.00;AN=2;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.00;SOR=3.056 GT:AD:DP:GQ:PL 1/1:0,9:9:27:352,27,0
chr1H 49336 . C G 242.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.976;ClippingRankSum=-0.409;DP=23;ExcessHet=3.0103;FS=8.953;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.913;QD=10.56;
ReadPosRankSum=-1.543;SOR=3.076 GT:AD:DP:GQ:PL 0/1:14,9:23:99:271,0,459
chr1H 49350 . A T 266.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.724;ClippingRankSum=0.913;DP=23;ExcessHet=3.0103;FS=8.953;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.913;QD=11.60;
ReadPosRankSum=-1.984;SOR=3.076 GT:AD:DP:GQ:PL 0/1:14,9:23:99:295,0,488
chr1H 49389 . C A 214.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.826;ClippingRankSum=1.016;DP=24;ExcessHet=3.0103;FS=8.860;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.191;QD=8.95;Re
adPosRankSum=0.762;SOR=3.126 GT:AD:DP:GQ:PL 0/1:17,7:24:99:243,0,1104
chr1H 49391 . C A 214.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.524;ClippingRankSum=-0.191;DP=24;ExcessHet=3.0103;FS=8.860;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.064;QD=8.95;Re
adPosRankSum=0.572;SOR=3.126 GT:AD:DP:GQ:PL 0/1:17,7:24:99:243,0,1104
chr1H 49410 . G T 172.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.447;ClippingRankSum=1.096;DP=19;ExcessHet=3.0103;FS=5.927;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-1.973;QD=9.09;Re
adPosRankSum=-0.044;SOR=2.584 GT:AD:DP:GQ:PL 0/1:13,6:19:99:201,0,452
chr1H 49442 . G A 115.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.134;ClippingRankSum=0.289;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.65;Rea
dPosRankSum=0.000;SOR=1.022 GT:AD:DP:GQ:PL 0/1:8,4:12:99:144,0,324

Read Groups without known barcodes

$
0
0

Dear all,

I am working with some barley accessions but the barcodes for the samples are not known (not publicly available). However, because the addition of the read groups is a prerequisite for BQSR and HaplotypeCaller, I wonder if I can I use the AddOrReplaceReadGroups command by substituting the barcodes with Ns? RGPU=NNNNNNNN-NNNNNNNN? Thank you in advance.

HaplotypeCaller output header and one position recode without error

$
0
0

I'm trying to run gatk4 HaplotypeCaller using the following command:

./gatk HaplotypeCaller -R ./reference.fasta --emit-ref-confidence GVCF --dbsnp ./samtools_gatk_common.vcf -I ./sample.bqsr.bam -O ./sample.gvcf --TMP_DIR ./tmp

the log output gives no error but the result *.gvcf file only contained header and one base recode. The dbsnp file was the intersection of samtools and gatk.

here the log file:

Using GATK jar /path/to/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /path/to/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar HaplotypeCaller -R /path/to/index/chrom23.fasta --emit-ref-confidence GVCF --dbsnp /path/to/dbsnp/sample.dbsnp.vcf -I /path/to/BQSR/sample.bqsr.bam -O /path/to/result/sample.g.vcf --TMP_DIR /path/to/tmp
18:38:47.051 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/path/to/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:38:47.439 INFO  HaplotypeCaller - ------------------------------------------------------------
18:38:47.440 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.4.0
18:38:47.440 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
18:38:47.442 INFO  HaplotypeCaller - Executing as hankai@cngb-compute-e05-6.cngb.sz.hpc on Linux v2.6.32-696.el6.x86_64 amd64
18:38:47.442 INFO  HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_172-b11
18:38:47.442 INFO  HaplotypeCaller - Start Date/Time: July 4, 2018 6:38:46 PM CST
18:38:47.442 INFO  HaplotypeCaller - ------------------------------------------------------------
18:38:47.442 INFO  HaplotypeCaller - ------------------------------------------------------------
18:38:47.443 INFO  HaplotypeCaller - HTSJDK Version: 2.14.3
18:38:47.443 INFO  HaplotypeCaller - Picard Version: 2.18.2
18:38:47.444 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:38:47.444 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:38:47.444 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:38:47.444 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:38:47.444 INFO  HaplotypeCaller - Deflater: IntelDeflater
18:38:47.444 INFO  HaplotypeCaller - Inflater: IntelInflater
18:38:47.444 INFO  HaplotypeCaller - GCS max retries/reopens: 20
18:38:47.444 INFO  HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
18:38:47.444 INFO  HaplotypeCaller - Initializing engine
18:38:50.210 INFO  FeatureManager - Using codec VCFCodec to read file file:///path/to/dbsnp/sample.dbsnp.vcf
18:38:50.292 INFO  HaplotypeCaller - Done initializing engine
18:38:50.303 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
18:38:50.303 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
18:38:51.794 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/path/to/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
18:38:51.817 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/path/to/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
18:38:51.915 WARN  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
18:38:51.916 INFO  IntelPairHmm - Available threads: 112
18:38:51.916 INFO  IntelPairHmm - Requested threads: 4
18:38:51.916 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
18:38:51.996 INFO  ProgressMeter - Starting traversal
18:38:51.997 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
18:39:02.152 INFO  ProgressMeter - pseudochrom_23:39888              0.2                   240           1418.4
18:39:12.324 INFO  ProgressMeter - pseudochrom_23:112351              0.3                   650           1918.6
18:39:22.383 INFO  ProgressMeter - pseudochrom_23:166271              0.5                   980           1935.1
18:39:32.471 INFO  ProgressMeter - pseudochrom_23:208604              0.7                  1240           1838.3
18:39:42.498 INFO  ProgressMeter - pseudochrom_23:270983              0.8                  1610           1912.8
18:39:52.827 INFO  ProgressMeter - pseudochrom_23:315473              1.0                  1890           1864.2
18:40:03.130 INFO  ProgressMeter - pseudochrom_23:368748              1.2                  2220           1872.5
18:40:13.602 INFO  ProgressMeter - pseudochrom_23:430805              1.4                  2590           1905.3
18:40:23.620 INFO  ProgressMeter - pseudochrom_23:512763              1.5                  3060           2003.9
18:40:33.781 INFO  ProgressMeter - pseudochrom_23:592148              1.7                  3540           2086.8
18:40:46.199 INFO  ProgressMeter - pseudochrom_23:661025              1.9                  3950           2075.3
18:40:56.336 INFO  ProgressMeter - pseudochrom_23:731629              2.1                  4380           2113.6
18:41:09.819 INFO  ProgressMeter - pseudochrom_23:835707              2.3                  5000           2176.7
18:41:19.874 INFO  ProgressMeter - pseudochrom_23:941548              2.5                  5630           2284.3
18:41:30.479 INFO  ProgressMeter - pseudochrom_23:1044902              2.6                  6230           2358.6
18:41:40.552 INFO  ProgressMeter - pseudochrom_23:1157010              2.8                  6910           2459.7
18:41:50.606 INFO  ProgressMeter - pseudochrom_23:1222918              3.0                  7310           2455.6
18:42:00.695 INFO  ProgressMeter - pseudochrom_23:1305523              3.1                  7790           2477.0
18:42:10.765 INFO  ProgressMeter - pseudochrom_23:1457789              3.3                  8680           2620.1
18:42:20.899 INFO  ProgressMeter - pseudochrom_23:1636208              3.5                  9750           2800.4
18:42:30.922 INFO  ProgressMeter - pseudochrom_23:1780023              3.6                 10640           2916.1
18:42:40.981 INFO  ProgressMeter - pseudochrom_23:1955789              3.8                 11720           3071.0
18:42:51.075 INFO  ProgressMeter - pseudochrom_23:2108472              4.0                 12660           3177.2
18:43:01.113 INFO  ProgressMeter - pseudochrom_23:2286350              4.2                 13710           3302.1
18:43:11.157 INFO  ProgressMeter - pseudochrom_23:2484540              4.3                 14930           3456.6
18:43:21.167 INFO  ProgressMeter - pseudochrom_23:2607582              4.5                 15660           3490.7
18:43:31.253 INFO  ProgressMeter - pseudochrom_23:2779264              4.7                 16750           3598.9
18:43:41.256 INFO  ProgressMeter - pseudochrom_23:2958401              4.8                 17840           3700.5
18:43:51.431 INFO  ProgressMeter - pseudochrom_23:3091735              5.0                 18670           3741.1
18:44:01.489 INFO  ProgressMeter - pseudochrom_23:3256919              5.2                 19650           3809.5
18:44:11.888 INFO  ProgressMeter - pseudochrom_23:3395538              5.3                 20500           3845.1
18:44:22.047 INFO  ProgressMeter - pseudochrom_23:3496925              5.5                 21130           3841.2
18:44:32.048 INFO  ProgressMeter - pseudochrom_23:3647997              5.7                 22050           3890.6
18:44:42.058 INFO  ProgressMeter - pseudochrom_23:3770277              5.8                 22830           3913.0
18:44:52.224 INFO  ProgressMeter - pseudochrom_23:3855394              6.0                 23350           3889.2
18:45:02.305 INFO  ProgressMeter - pseudochrom_23:3961378              6.2                 24000           3888.7
18:45:12.396 INFO  ProgressMeter - pseudochrom_23:4077288              6.3                 24700           3895.9
18:45:22.481 INFO  ProgressMeter - pseudochrom_23:4209807              6.5                 25510           3919.8
18:45:32.603 INFO  ProgressMeter - pseudochrom_23:4301812              6.7                 26100           3909.1
18:45:42.779 INFO  ProgressMeter - pseudochrom_23:4400034              6.8                 26720           3902.8
18:45:53.263 INFO  ProgressMeter - pseudochrom_23:4475456              7.0                 27180           3871.2
18:46:04.692 INFO  ProgressMeter - pseudochrom_23:4607856              7.2                 28000           3882.6
18:46:14.837 INFO  ProgressMeter - pseudochrom_23:4739532              7.4                 28790           3900.7
18:46:26.963 INFO  ProgressMeter - pseudochrom_23:4805956              7.6                 29230           3854.8
18:46:37.150 INFO  ProgressMeter - pseudochrom_23:4932551              7.8                 30010           3871.0
18:46:47.557 INFO  ProgressMeter - pseudochrom_23:5051360              7.9                 30750           3879.6
18:46:57.575 INFO  ProgressMeter - pseudochrom_23:5156893              8.1                 31410           3881.1
18:47:07.589 INFO  ProgressMeter - pseudochrom_23:5256960              8.3                 32020           3876.6
18:47:17.844 INFO  ProgressMeter - pseudochrom_23:5339306              8.4                 32520           3857.3
18:47:28.069 INFO  ProgressMeter - pseudochrom_23:5447309              8.6                 33170           3856.4
18:47:38.135 INFO  ProgressMeter - pseudochrom_23:5562641              8.8                 33870           3862.5
18:47:48.259 INFO  ProgressMeter - pseudochrom_23:5648642              8.9                 34390           3847.8
18:47:58.434 INFO  ProgressMeter - pseudochrom_23:5750249              9.1                 35010           3844.2
18:48:09.065 INFO  ProgressMeter - pseudochrom_23:5853949              9.3                 35650           3839.7
18:48:19.112 INFO  ProgressMeter - pseudochrom_23:5955110              9.5                 36280           3838.4
18:48:29.206 INFO  ProgressMeter - pseudochrom_23:6051364              9.6                 36860           3831.5
18:48:39.584 INFO  ProgressMeter - pseudochrom_23:6140606              9.8                 37400           3819.0
18:48:49.694 INFO  ProgressMeter - pseudochrom_23:6228203             10.0                 37930           3807.6
18:48:59.742 INFO  ProgressMeter - pseudochrom_23:6327447             10.1                 38550           3805.9
18:49:10.118 INFO  ProgressMeter - pseudochrom_23:6412023             10.3                 39070           3792.5
18:49:20.131 INFO  ProgressMeter - pseudochrom_23:6528580             10.5                 39780           3799.8
18:49:30.488 INFO  ProgressMeter - pseudochrom_23:6664489             10.6                 40640           3819.0
18:49:41.323 INFO  ProgressMeter - pseudochrom_23:6776006             10.8                 41330           3819.0
18:49:51.947 INFO  ProgressMeter - pseudochrom_23:6871397             11.0                 41910           3810.3
18:50:02.348 INFO  ProgressMeter - pseudochrom_23:6965003             11.2                 42470           3801.3
18:50:12.656 INFO  ProgressMeter - pseudochrom_23:7064647             11.3                 43070           3796.6
18:50:22.681 INFO  ProgressMeter - pseudochrom_23:7129699             11.5                 43450           3774.5
18:50:32.723 INFO  ProgressMeter - pseudochrom_23:7217180             11.7                 43990           3766.7
18:50:42.805 INFO  ProgressMeter - pseudochrom_23:7334195             11.8                 44720           3774.9
18:50:52.874 INFO  ProgressMeter - pseudochrom_23:7470037             12.0                 45560           3792.0
18:51:03.070 INFO  ProgressMeter - pseudochrom_23:7580430             12.2                 46240           3795.0
18:51:13.109 INFO  ProgressMeter - pseudochrom_23:7703064             12.4                 46990           3804.3
18:51:23.274 INFO  ProgressMeter - pseudochrom_23:7839176             12.5                 47810           3818.3
18:51:33.338 INFO  ProgressMeter - pseudochrom_23:7960865             12.7                 48540           3825.4
18:51:43.392 INFO  ProgressMeter - pseudochrom_23:8028264             12.9                 48960           3808.2
18:51:53.463 INFO  ProgressMeter - pseudochrom_23:8151834             13.0                 49710           3816.7
18:52:03.665 INFO  ProgressMeter - pseudochrom_23:8270942             13.2                 50430           3822.1
18:52:13.727 INFO  ProgressMeter - pseudochrom_23:8359715             13.4                 50970           3814.5
18:52:23.905 INFO  ProgressMeter - pseudochrom_23:8477290             13.5                 51650           3816.9
18:52:33.954 INFO  ProgressMeter - pseudochrom_23:8594099             13.7                 52380           3823.6
18:52:44.110 INFO  ProgressMeter - pseudochrom_23:8710379             13.9                 53100           3828.8
18:52:54.114 INFO  ProgressMeter - pseudochrom_23:8848199             14.0                 53970           3845.3
18:53:04.680 INFO  ProgressMeter - pseudochrom_23:8983340             14.2                 54800           3856.1
18:53:15.384 INFO  ProgressMeter - pseudochrom_23:9068836             14.4                 55310           3843.7
18:53:25.473 INFO  ProgressMeter - pseudochrom_23:9222012             14.6                 56240           3863.2
18:53:35.477 INFO  ProgressMeter - pseudochrom_23:9305881             14.7                 56750           3854.1
18:53:45.512 INFO  ProgressMeter - pseudochrom_23:9431585             14.9                 57500           3861.2
18:53:55.687 INFO  ProgressMeter - pseudochrom_23:9550933             15.1                 58210           3864.8
18:54:05.702 INFO  ProgressMeter - pseudochrom_23:9694239             15.2                 59090           3880.3
18:54:15.903 INFO  ProgressMeter - pseudochrom_23:9779200             15.4                 59620           3871.8
18:54:25.917 INFO  ProgressMeter - pseudochrom_23:9884556             15.6                 60260           3871.4
18:54:36.002 INFO  ProgressMeter - pseudochrom_23:9991326             15.7                 60900           3870.7
18:54:46.010 INFO  ProgressMeter - pseudochrom_23:10127422             15.9                 61710           3881.1
18:54:56.072 INFO  ProgressMeter - pseudochrom_23:10247506             16.1                 62430           3885.4
18:55:06.287 INFO  ProgressMeter - pseudochrom_23:10372627             16.2                 63210           3892.7
18:55:16.338 INFO  ProgressMeter - pseudochrom_23:10508632             16.4                 64040           3903.5
18:55:26.423 INFO  ProgressMeter - pseudochrom_23:10605673             16.6                 64630           3899.5
18:55:36.484 INFO  ProgressMeter - pseudochrom_23:10680890             16.7                 65090           3888.0
18:55:46.555 INFO  ProgressMeter - pseudochrom_23:10755549             16.9                 65530           3875.4
18:55:56.618 INFO  ProgressMeter - pseudochrom_23:10860581             17.1                 66160           3874.2
18:56:06.724 INFO  ProgressMeter - pseudochrom_23:10958345             17.2                 66750           3870.6
18:56:16.801 INFO  ProgressMeter - pseudochrom_23:11078670             17.4                 67480           3875.2
18:56:26.824 INFO  ProgressMeter - pseudochrom_23:11172750             17.6                 68070           3871.9
18:56:36.886 INFO  ProgressMeter - pseudochrom_23:11297520             17.7                 68800           3876.5
18:56:46.910 INFO  ProgressMeter - pseudochrom_23:11394420             17.9                 69390           3873.2
18:56:56.924 INFO  ProgressMeter - pseudochrom_23:11466077             18.1                 69840           3862.4
18:57:06.975 INFO  ProgressMeter - pseudochrom_23:11575994             18.2                 70500           3863.1
18:57:17.094 INFO  ProgressMeter - pseudochrom_23:11713112             18.4                 71340           3873.3
18:57:27.171 INFO  ProgressMeter - pseudochrom_23:11835109             18.6                 72080           3878.1
18:57:37.329 INFO  ProgressMeter - pseudochrom_23:11907584             18.8                 72540           3867.7
18:57:47.364 INFO  ProgressMeter - pseudochrom_23:12031631             18.9                 73340           3875.8
18:57:57.451 INFO  ProgressMeter - pseudochrom_23:12122040             19.1                 73890           3870.4
18:58:07.495 INFO  ProgressMeter - pseudochrom_23:12238860             19.3                 74590           3873.1
18:58:17.565 INFO  ProgressMeter - pseudochrom_23:12364885             19.4                 75350           3878.8
18:58:27.731 INFO  ProgressMeter - pseudochrom_23:12451270             19.6                 75890           3872.8
18:58:38.320 INFO  ProgressMeter - pseudochrom_23:12537057             19.8                 76410           3864.5
18:58:48.414 INFO  ProgressMeter - pseudochrom_23:12580452             19.9                 76650           3844.0
18:58:59.346 INFO  ProgressMeter - pseudochrom_23:12630247             20.1                 76930           3823.1
18:59:10.085 INFO  ProgressMeter - pseudochrom_23:12746384             20.3                 77510           3818.0
18:59:20.474 INFO  ProgressMeter - pseudochrom_23:12814970             20.5                 77930           3806.2
18:59:30.683 INFO  ProgressMeter - pseudochrom_23:12833522             20.6                 78040           3780.1
18:59:41.531 INFO  ProgressMeter - pseudochrom_23:12867911             20.8                 78220           3756.0
18:59:51.979 INFO  ProgressMeter - pseudochrom_23:12898083             21.0                 78380           3732.4
19:00:02.811 INFO  ProgressMeter - pseudochrom_23:12912010             21.2                 78460           3704.4
19:00:12.854 INFO  ProgressMeter - pseudochrom_23:12954239             21.3                 78720           3687.5
19:00:23.618 INFO  ProgressMeter - pseudochrom_23:13045215             21.5                 79170           3677.7
19:00:33.765 INFO  ProgressMeter - pseudochrom_23:13113654             21.7                 79520           3665.2
19:00:46.176 INFO  ProgressMeter - pseudochrom_23:13230637             21.9                 80100           3657.0
19:00:57.561 INFO  ProgressMeter - pseudochrom_23:13254119             22.1                 80230           3631.5
19:01:11.951 INFO  ProgressMeter - pseudochrom_23:13277140             22.3                 80370           3598.8
19:01:23.954 INFO  ProgressMeter - pseudochrom_23:13291793             22.5                 80450           3570.4
19:01:34.143 INFO  ProgressMeter - pseudochrom_23:13313750             22.7                 80580           3549.4
19:01:44.470 INFO  ProgressMeter - pseudochrom_23:13410560             22.9                 81090           3545.0
19:01:54.793 INFO  ProgressMeter - pseudochrom_23:13469784             23.0                 81440           3533.7
19:02:05.477 INFO  ProgressMeter - pseudochrom_23:13499022             23.2                 81590           3513.1
19:02:15.584 INFO  ProgressMeter - pseudochrom_23:13574066             23.4                 81950           3503.2
19:02:27.238 INFO  ProgressMeter - pseudochrom_23:13603519             23.6                 82110           3481.1
19:02:37.410 INFO  ProgressMeter - pseudochrom_23:13625698             23.8                 82240           3461.7
19:02:48.228 INFO  ProgressMeter - pseudochrom_23:13691826             23.9                 82570           3449.4
19:02:59.032 INFO  ProgressMeter - pseudochrom_23:13757035             24.1                 82950           3439.4
19:03:09.114 INFO  ProgressMeter - pseudochrom_23:13779661             24.3                 83100           3421.8
19:03:19.416 INFO  ProgressMeter - pseudochrom_23:13820635             24.5                 83330           3407.2
19:03:29.183 INFO  HaplotypeCaller - 55869059 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
  55869059 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
      55869059 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
          55869059 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
              55869059 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
                  47376329 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
                      46853127 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
                          46853127 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
                              46853127 read(s) filtered by: MappingQualityReadFilter 
                      523202 read(s) filtered by: NotSecondaryAlignmentReadFilter 
                  8492730 read(s) filtered by: NotDuplicateReadFilter 

19:03:29.184 INFO  ProgressMeter - pseudochrom_23:13859898             24.6                 83586           3395.1
19:03:29.184 INFO  ProgressMeter - Traversal complete. Processed 83586 total regions in 24.6 minutes.
19:03:30.381 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
19:03:30.381 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
19:03:30.381 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
19:03:30.381 INFO  HaplotypeCaller - Shutting down engine
[July 4, 2018 7:03:30 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 24.73 minutes.
Runtime.totalMemory()=372873625

and the result *.gvcf:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller  --dbsnp /path/to/dbsnp/sample.dbsnp.vcf --emit-ref-confidence GVCF --output /path/to/result/sample.g.vcf --input /path/to/BQSR/sample.bqsr.bam --reference /path/to/index/chrom23.fasta --TMP_DIR /path/to/tmp  --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --disable-tool-default-annotations false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-recover-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --disable-tool-default-read-filters false --minimum-mapping-quality 20",Version=4.0.4.0,Date="July 4, 2018 6:38:51 PM CST">
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive)
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive)
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive)
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive)
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive)
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive)
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive)
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive)
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive)
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive)
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive)
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive)
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive)
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive)
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive)
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive)
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive)
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive)
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive)
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive)
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive)
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive)
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive)
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive)
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive)
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive)
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive)
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive)
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive)
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive)
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive)
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive)
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive)
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive)
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive)
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=pseudochrom_23,length=13860564>
##source=HaplotypeCaller
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CL100020307_L01_17
pseudochrom_23  1   .   A   <NON_REF>   .   .   END=13860564    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0

I don't know if it's reasonable to suppose that there must be some variation, as the dbsnp vcf file contained 11733 variation. Even if there is no variation, HaplotypeCaller should output all recode like position 1. But there is nothing.

gatk 4.05 and "fix_misencoded_quality_scores"

$
0
0

Switching from gatk 3.8 to 4.05 to avail myself of '--enotype-filtered-alleles' I discovered a few format changes to familiar options.

But '--fix_misencoded_quality_scores' is no longer recognized.

My question is what does HC (gatk 4) do when it encounters the problems that were previously handled with the '--fix_misencoded_quality_scores' option?

Cheers,
Chuck

(How to part I) Sensitively detect copy ratio alterations and allelic segments

$
0
0

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.


image This workflow is broken into two tutorials. You are currently on the first part.

The tutorial outlines steps in detecting copy ratio alterations, more familiarly copy number variants (CNVs), as well as allelic segments in a single sample using GATK4. The tutorial (i) denoises case sample alignment data against a panel of normals (PoN) to obtain copy ratios (Tutorial#11682) and (ii) models segments from the copy ratios and allelic counts (Tutorial#11683). The latter modeling incorporates data from a matched control. The same workflow steps apply to targeted exome and whole genome sequencing data.

Tutorial#11682 covers sections 1–4. Section 1 prepares a genomic intervals list with PreprocessIntervals and collects read coverage counts across the intervals. Section 2 creates a CNV PoN with CreateReadCountPanelOfNormals using read coverage counts. Section 3 denoises read coverage data against the PoN with DenoiseReadCounts using principal component analysis. Section 4 plots the results of standardizing and denoising copy ratios against the PoN.

Tutorial#11683 covers sections 5–8. Section 5 collects counts of reference versus alternate alleles with CollectAllelicCounts. Section 6 incorporates copy ratio and allelic counts data to group contiguous copy ratio and allelic counts segments with ModelSegments using kernel segmentation and Markov-chain Monte Carlo. The tool can also segment either copy ratio data or allelic counts data alone. Both types of data together refine segmentation results in that segments are based on the same copy ratio and the same minor allele fraction. Section 7 calls amplification, deletion and neutral events for the segmented copy ratios. Finally, Section 8 plots the results of segmentation and estimated allele-specific copy ratios.

Plotting is across genomic loci on the x-axis and copy or allelic ratios on the y-axis. The first part of the workflow focuses on removing systematic noise from coverage counts and adjusts the data points vertically. The second part focuses on segmentation and groups the data points horizontally. The extent of grouping, or smoothing, is adjustable with ModelSegments parameters. These adjustments do not change the copy ratios; the denoising in the first part of the workflow remains invariant in the second part of the workflow. See Figure 3 of this poster for a summary of tutorial results.

► The official GATK4 workflow is capable of running efficiently on WGS data and provides much greater resolution, up to ~50-fold more resolution for tested data. In these ways, GATK4 CNV improves upon its predecessor workflows in GATK4.alpha and GATK4.beta. Validations are still in progress and therefore the workflow itself is in BETA status, even if most tools, with the exception of ModelSegments, are production ready. The ModelSegments tool is still in BETA status and may change in small but significant ways going forward. Use at your own risk.

► The tutorial skips explicit GC-correction, an option in CNV analysis. For instructions on how to correct for GC bias, see AnnotateIntervals and DenoiseReadCounts tool documentation.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.


Jump to a section

  1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts
    1.1 How do I view HDF5 format data?
  2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals
  3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts
  4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios
    4.1 Compare two PoNs: considerations in panel of normals creation
    4.2 Compare PoN denoising versus matched-normal denoising
  5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts
    5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?
  6. Group contiguous copy ratios into segments with ModelSegments
  7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments
  8. Plot modeled segments and allelic copy ratios with PlotModeledSegments
    8.1 Some considerations in interpreting allelic copy ratios
    8.2 Some results of fine-tuning smoothing parameters

Tools involved

  • GATK 4.0.1.1 or later releases.
  • The plotting tools require particular R components. Options are to install these or to use the broadinstitute/gatk Docker. In particular, to match versions, use the broadinstitute/gatk:4.0.1.1 version.

Download example data

Download tutorial_11682.tar.gz and tutorial_11683.tar.gz, either from the GoogleDrive or from the FTP site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data, see Tutorial#11136's third footnote and [1].

Alternatively, download the spacecade7/tutorial_11682_11683 docker image from DockerHub. The image contains GATK4.0.1.1 and the data necessary to run the tutorial commands, including the GRCh38 reference. Allocation of at least 4GB memory to Docker is recommended before launching the container.


1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

Before collecting coverage counts that forms the basis of copy number variant detection, we define the resolution of the analysis with a genomic intervals list. The extent of genomic coverage and the size of genomic intervals in the intervals list factor towards resolution.

Preparing a genomic intervals list is necessary whether an analysis is on targeted exome data or whole genome data. In the case of exome data, we pad the target regions of the capture kit. In the case of whole genome data, we divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.

For the tutorial exome data, we provide the capture kit target regions in 1-based intervals and set --bin-length to zero.

gatk PreprocessIntervals \
    -L targets_C.interval_list \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/targets_C.preprocessed.interval_list

This produces a Picard-style intervals list targets_C.preprocessed.interval_list for use in the coverage collection step. Each interval is expanded 250 bases each on either side.

Comments on select parameters

  • The -L argument is optional. If provided, the tool expects the intervals list to be in Picard-style as described in Article#1319. The tool errs for other formats. If this argument is omitted, then the tool assumes each contig is a single interval. See [2] for additional discussion.
  • Set the --bin-length argument to be appropriate for the type of data, e.g. default 1000 for whole genome or 0 for exomes. In binning, an interval is divided into equal-sized regions of the specified length. The tool does not bin regions that contain Ns. [3]
  • Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
  • The --reference or -R is required and implies the presence of a corresponding reference index and a reference dictionary in the same directory.
  • To change the padding interval, specify the new value with --padding. The default value of 250 bases was determined to work well empirically for TCGA targeted exome data. This argument is relevant for exome data, as binning without an intervals list does not allow for intervals expansion. [5]

Take a look at the intervals before and after padding.

cnv_intervals

For consecutive intervals less than 250 bases apart, how does the tool pad the intervals?

Now collect raw integer counts data. The tutorial uses GATK4.0.1.1's CollectFragmentCounts, which counts coverage of paired end fragments. The tool counts once per fragment overlapping at its center with the interval. In GATK4.0.3.0, CollectReadCounts replaces CollectFragmentCounts. CollectReadCounts counts reads that overlap the interval.

The tutorial has already collected coverage on the tumor case sample, on the normal matched-control and on each of the normal samples that constitute the PoN. To demonstrate coverage collection, the following command uses the small BAM from Tutorial#11136’s data bundle [6]. The tutorial does not use the resulting file in subsequent steps. The CollectReadCounts command swaps out the tool name but otherwise uses identical parameters.

gatk CollectFragmentCounts \
    -I tumor.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/tumor.counts.hdf5

In the tutorial data bundle, the equivalent full-length result is hcc1143_T_clean.counts.hdf5. The data tabulates CONTIG, START, END and raw COUNT values for each genomic interval.

Comments on select parameters

  • The -L argument interval list is a Picard-style interval list prepared with PreprocessIntervals.
  • The -I input is alignment data.
  • By default, data is in HDF5 format. To generate text-based TSV (tab-separated values) format data, specify --format TSV. The HDF5 format allows for quicker panel of normals creation.
  • Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
  • The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter, FirstOfPairReadFilter, ProperlyPairedReadFilter and MappingQualityReadFilter. [7]

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in this tutorial.


back to top


2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

In creating a PoN, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD, 1), a type of Principal Component Analysis (PCA, 1, 2, 3). The normal samples in the PoN should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise.

The tutorial has already created a CNV panel of normals using forty 1000 Genomes Project samples. The command below illustrates PoN creation using just three samples.

gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5 \
    -I HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5 \
    -I NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O sandbox/cnvponC.pon.hdf5

This generates a PoN in HDF5 format. The PoN stores information that, when applied, will (i) standardize case sample counts to PoN median counts and (ii) remove systematic noise in the case sample.

Comments on select parameters

  • Provide integer read coverage counts for each sample using -I. Coverage data may be in either TSV or HDF5 format. The tool will accept a single sample, e.g. the matched-normal.
  • The default --number-of-eigensamples or principal components is twenty. The tool will adjust this number to the smaller of twenty or the number of samples the tool retains after filtering. In general, denoising against a PoN with more components improves segmentation, but at the expense of sensitivity. Ideally, researchers should perform a sensitivity analysis to choose an appropriate value for this parameter. See this related discussion.
  • To run the tool using Spark, specify the Spark Master with --spark-master. See Article#11245 for details.

Comments on filtering and imputation parameters, in the order of application

  1. The tutorial changes the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0. The tool filters out targets or bins with a median proportional coverage below this percentile. The median is across the samples. The proportional coverage is the target coverage divided by the sum of the coverage of all targets for a sample. The effect of setting this parameter to a smaller value is that we retain more information.
  2. The --maximum-zeros-in-sample-percentage default is 5.0. Any sample with more than 5% zero coverage targets is filtered.
  3. The --maximum-zeros-in-interval-percentage default is 5.0. Any target interval with more than 5% zero coverage across samples is filtered.
  4. The --extreme-sample-median-percentile default is 2.5. Any sample with less than 2.5 percentile or more than 97.5 percentile normalized median proportional coverage is filtered.
  5. The --do-impute-zeros default is set to true. The tool takes zero coverage regions and changes these values to the median of the non-zero values. The tool additionally normalizes zero values below the 0.10 percentile or above the 99.90 percentile to.
  6. The --extreme-outlier-truncation-percentile default is 0.1. The tool takes any proportional coverage below the 0.1 percentile or above the 99.9 percentile and sets it to the corresponding percentile value.

The current filtering and imputation parameters are identical to that in the BETA release of the CNV workflow and may change for later versions based on evaluations. The implementation has been made to be more memory efficient so that the tool runs faster than the BETA release.

If the data are not uniform, e.g. has many intervals with zero or low counts, the tool gives the warning to adjust filtering parameters and stops the run. This may happen, for example, if one attempts to construct a panel of mixed-sex samples and includes the allosomal contigs [8]. In this case, first be sure to either exclude allosomal contigs via a subset intervals list or subset the panel samples to those expected to have similar coverage across the given contigs, e.g. panels of the same sex. If the warning still occurs, then adjust --minimum-interval-median-percentile to a larger value. See this thread for the original discussion.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers? Could PCA account for GC-bias?
What do you know about the 1000 Genome Project? Specifically, the exome data?
How could we tell a good PoN from a bad PoN? What control could we use?

In a somatic analysis, what is better for a PoN: tissue-matched normals or blood normals?
Should we include our particular tumor’s matched normal in the PoN?


back to top


3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

Provide DenoiseReadCounts with counts collected by CollectFragmentCounts and the CNV PoN generated with CreateReadCountPanelOfNormals.

gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios sandbox/hcc1143_T_clean.denoisedCR.tsv

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

Comments on select parameters

  • Because the default --number-of-eigensamples is null, the tool uses the maximum number of eigensamples available in the PoN. In section 2, by using default CreateReadCoundPanelOfNormals parameters, we capped the number of eigensamples in the PoN to twenty. Changing the --number-of-eigensamples in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. See this thread for detailed discussion.
  • Additionally provide the optional --annotated-intervals generated by AnnotateIntervals to concurrently perform GC-bias correction.


back to top


4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

We plot the standardized and denoised read counts with PlotDenoisedCopyRatios. The plots allow visually assessing the efficacy of denoising. Provide the tool with both the standardized and denoised copy ratios from the previous step as well as a reference sequence dictionary.

gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces six files in the plots directory--two PNG images and four text files as follows.

  • hcc1143_T_clean.denoised.png plots the standardized and denoised read counts across the contigs and scales the y-axis to accommodate all copy ratio data.
  • hcc1143_T_clean.denoisedLimit4.png plots the same but limits the y-axis range from 0 to 4 for comparability across samples.

Each of the text files contains a single quality control value. The value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number events and should decrease after denoising.

  • hcc1143_T_clean.standardizedMAD.txt gives the MAD for standardized copy ratios.
  • hcc1143_T_clean.denoisedMAD.txt gives the MAD for denoised copy ratios.
  • hcc1143_T_clean.deltaMAD.txt gives the difference between standardized MAD and denoised MAD.
  • hcc1143_T_clean.scaledDeltaMAD.txt gives the fractional difference (standardized MAD - denoised MAD)/(standardized MAD).

Comments on select parameters

  • The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping.
  • To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

Here are the results for the HCC1143 tumor cell line and its matched normal cell line. The normal cell line serves as a control. For each sample are two plots that show the effects of PCA denoising. The upper plot shows standardized copy ratios in blue and the lower plot shows denoised copy ratios in green.

4A. Tumor standarized and denoised copy ratio plots
hcc1143_T_clean.denoisedLimit4.png

4B. Normal standarized and denoised copy ratio plots
hcc1143_N_clean.denoisedLimit4.png

Would you guess there are CNV events in the normal? Should we be surprised?

The next step is to perform segmentation. This can be done either using copy ratios alone or in combination with allelic copy ratios. In part II, Section 6 outlines considerations in modeling segments with allelic copy ratios, section 7 generates a callset and section 8 shows how to plot segmented copy and allelic ratios. Again, the tutorial presents these steps using the full features of the workflow. However, researchers may desire to perform copy ratio segmentation independently of allelic counts data, e.g. for a case without a matched-control. For the case-only, segmentation gives the following plots. To recapitulate this approach, omit allelic-counts parameters from the example commands in sections 6 and 8.

4C. Tumor case-only copy ratios segmentation gives 235 segments.
T_caseonly.modeled.png

4D. Normal case-only copy ratios segmentation gives 41 segments.
hcc1143_N_caseonly.png

While the normal sample shows trisomy of chr2 and a subpopulation with deletion of chr6, the tumor sample is highly aberrant. The extent of aneuploidy is unsurprising and consistent with these HCC1143 tumor dSKY results by Wenhan Chen. Remember that cell lines, with increasing culture time and selective bottlenecks, can give rise to new somatic events, undergo clonal selection and develop population heterogeneity much like in cancer.


☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

Denoising with a PoN is critical for calling copy-number variants from targeted exome coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

To understand the impact a PoN's constituents can have on an analysis, compare the results of denoising the normal sample against two different PoNs. Each PoN consists of forty 1000 Genomes Project exome samples. PoN-M consists of the same cohort used in the Mutect2 tutorial's PoN. We selected PoN-C's constituents with more care and this is the PoN the CNV tutorial uses.

4E. Compare standardization and denoising with PoN-C versus PoN-M.
compare_pons.png

What is the difference in the targets for the two cohorts--cohort-M and cohort-C? Is this a sufficient reason for the difference in noise profiles we observe above?

GATK4 denoises exome coverage profiles robustly with either panel of normals. However, a good panel allows maximal denoising, as is the case for PoN-C over PoN-M.

We use publically available 1000 Genomes Project data so as to be able to share the data and to illustrate considerations in CNV analyses. In an actual somatic analysis, we would construct the PoNs using the blood normals of the tumor cohort(s). We would construct a PoN for each sex, so as to be able to call events on allosomal chromosomes. Such a PoN should give better results than that from either of the tutorial PoNs.

Somatic analyses, due to the confounding factors of tumor purity and heterogeneity, require high sensitivity in calling. However, a sensitive caller can only do so much. Use of a carefully constructed PoN augments the sensitivity and helps illuminate copy number events.

This section is adapted from a hands-on tutorial developed and written by Soo Hee Lee (@shlee) in July of 2017 for the GATK workshops in Cambridge and Edinburgh, UK. The original tutorial uses the GATK4.beta workflow and can be found in the 1707 through 1711 GATK workshops folders. Although the Somatic CNV workflow has changed from GATK4.beta and the official GATK4 release, the PCA denoising remains the same. The hands-on tutorial focuses on differences in PCA denoising based on two different panels of normals (PoNs). Researchers may find working through the worksheet to the very end with either release version beneficial, as considerations in selecting PoN constituents remain identical.

Examining the read group information for the samples in the two PoNs shows a difference in mixtures of sequencing centers--four different sequencing centers for PoN-M versus a single sequencing center for PoN-C. The single sequencing center corresponds to that of the HCC1143 samples. Furthermore, tracing sample information will show different targeted exome capture kits for the sequencing centers. Comparing the denoising results of the two PoNs stresses the importance of selective PoN creation.


☞ 4.2 Compare PoN denoising versus matched-normal denoising

A feature of the GATK4 CNV workflow is the ability to normalize a case against a single control sample, e.g. a tumor case against its matched normal. This involves running the control sample through CreateReadCountPanelOfNormals, then denoising the case against this single-sample projection with DenoiseReadCounts. To illustrate this approach, here is the result of denoising the HCC1143 tumor sample against its matched normal. For single-sample matched-control denoising, DenoiseReadCounts produces identical data for standardizedCR.tsv and denoisedCR.tsv.

4F. Tumor case standardized against the normal matched-control
T_normalonly.png

Compare these results to that of section 4.1. Notice the depression in chr2 copy ratios that occurs due to the PoN normal sample's chr2 trisomy. Here, the median absolute deviation (MAD) of 0.149 is an incremental improvement to section 4.1's PoN-M denoising (MAD=0.15). In contrast, PoN-C denoising (MAD=0.125) and even PoN-C standardization alone (MAD=0.134) are seemingly better normalization approaches than the matched-normal standardization. Again, results stress the importance of selective PoN creation.

The PoN accounts for germline CNVs common to its constituents such that the workflow discounts the same variation in the case. It is possible for the workflow to detect germline CNVs not represented in the PoN, in particular, rare germline CNVs. In the case of matched-normal standardization, the workflow should discount germline CNVs and reveal only somatic events.

The workflow does not support iteratively denoising two samples each against a PoN and then against each other.

The tutorial continues in a second document at #11683.

back to top


Footnotes


[1] The constituents of the forty sample CNV panel of normals differs from that of the Mutect2 panel of normals. Preliminarly CNV data was generated with v4.0.1.1 somatic CNV WDL scripts run locally on a Gcloud Compute Engine VM with Cromwell v30.2. Additional refinements were performed on a 16GB MacBook Pro laptop. Additional plots were generated using a broadinstitute/gatk:4.0.1.1 Docker container. Note the v4.0.1.1 WDL script does not allow custom sequence dictionaries for the plotting steps.


[2] Considerations in genomic intervals are as follows.

  • For targeted exomes, the intervals should represent the bait capture or target capture regions.
  • For whole genomes, either supply regions where coverage is expected across samples, e.g. that exclude alternate haplotypes and decoy regions in GRCh38 or omit the option for references where coverage is expected for the entirety of the reference.
  • For either type of data, expect to modify the intervals depending on (i) extent of masking in the reference used in read mapping and (ii) expectations in coverage on allosomal contigs. For example, for mammalian data, expect to remove Y chromosome intervals for female samples.


[3] See original discussion on bin size here. The bin size determines the resolution of CNV breakpoints. The theoretical limit depends on coverage depth and the insert-size distribution. Typically bin sizes on the order of the read length will give reasonable results. The GATK developers have tested WGS runs where the bin size is as small as 250 bases.


[4] Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. The default is set to ALL for GATK4.0.1.1. For future versions, the default will be set to OVERLAPPING_ONLY.


[5] The tool allows specifying both the padding and the binning arguments simultaneously. If exome targets are very long, it may be preferable to both pad and break up the intervals with binning. This may provide some additional resolution.


[6] The data bundle from Tutorial#11136 contains tumor.bam and normal.bam. These tumor and normal samples are identical to that in the current tutorial and represent a subset of the full data for the following regions:

chr6    29941013    29946495    +    
chr11   915890  1133890 +    
chr17   1   83257441    +    
chr11_KI270927v1_alt    1   218612  +    
HLA-A*24:03:01  1   3502    +


[7] The following regarding read filters may be of interest and apply to the workflow illustated in the tutorial that uses CollectFragmentCounts.

  • In contrast to prior versions of the workflow, the GATK4 CNV workflow excludes duplicate fragments from consideration with the NotDuplicateReadFilter. To instead include duplicate fragments, specify -DF NotDuplicateReadFilter.
  • The tool only considers paired-end reads (0x1 SAM flag) and the first of pair (0x40 flag) with the FirstOfPairReadFilter. The tool uses the first-of-pair read’s mapping information for the fragment center.
  • The tool only considers properly paired reads (0x2 SAM flag) using the ProperlyPairedReadFilter. Depending on whether and how data was preprocessed with MergeBamAlignment, proper pair assignments can differ from that given by the aligner. This filter also removes single ended reads.
  • The MappingQualityReadFilter sets a threshold for alignment MAPQ. The tool sets --minimum-mapping-quality to 30. Thus, the tool uses reads with MAPQ30 or higher.


[8] The current tool version requires strategizing denoising of allosomal chromosomes, e.g. X and Y in humans, against the panel of normals. This is because coverage will vary for these regions depending on the sex of the sample. To determine the sex of samples, analyze them with DetermineGermlineContigPloidy. Aneuploidy in allosomal chromosomes, much like trisomy, can still make for viable organisms and so phenotypic sex designations are insufficient. GermlineCNVCaller can account for differential sex in data.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

back to top



Build the SNP recalibration model error

$
0
0

Hi,

I am trying to build the SNP recalibration model by running the following GATK command:

./gatk-4.0.3.0/gatk VariantRecalibrator \
-R human_g1k_v37_decoy.fasta \
-input /mergedFiles.vcf \
--resource hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
--resource omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--recalFile recalibrate_SNP.recal \
-tranchesFile output.tranches \
--rscriptFile output.plots.R

But I am getting following error.

Error:


A USER ERROR has occurred: Invalid argument 'hapmap_3.3.b37.sites.vcf'.


Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

I have used the human_g1k_v37_decoy.fasta for alignment therefore, using the same for recalibration. I would like to convert raw variants to ready to analysis variant by applying filtration,and annotation. Please let me know if you have any direction for best practice approach.

Thanks

What does it mean that a ALT homozygote site variant contain MQRankSum and ReadPosRankSum?

$
0
0

Hello GATK,

What does it mean that a ALT homozygote site contains MQRankSum and ReadPosRankSum?
From https://software.broadinstitute.org/gatk/documentation/article.php?id=2806
,I know that only heterozygote site contain MQRankSum and ReadPosRankSum, parameters showing quality of ALT sites. I think it is unreasonable.

I'm analyze babbler's population genome, then I got some variants that this situation happened. I will do VariantFiltration next. Before filtration, I want to clarify what do those parameters mean.

I take three types from my vcf file and showing below.

scaffold20_len495909_cov26 73630 . T C 206.78 . AC=2;AF=1.00;AN=2;DP=9;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=48.24;MQ0=0;QD=22.98;SOR=0.892 GT:AD:DP:GQ:PL 1/1:0,9:9:24:235,24,0
scaffold20_len495909_cov26 74196 . C T 205.89 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.282;DP=18;Dels=0.00;ExcessHet=3.0103;FS=9.542;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=27.08;MQ0=9;MQRankSum=1.593;QD=11.44;ReadPosRankSum=1.593;SOR=4.398 GT:AD:DP:GQ:PL 1/1:9,9:18:2:232,2,0
scaffold20_len495909_cov26 74256 . C T 66.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.823;DP=12;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=41.63;MQ0=2;MQRankSum=1.078;QD=5.56;ReadPosRankSum=0.930;SOR=0.892 GT:AD:DP:GQ:PL 0/1:8,4:12:95:95,0,132

Does that mean that I got wrong variants calling or not ?

Identifying contaminant reads in MergeBamAlignment

$
0
0

I noticed that several of my WGS runs in GATK4's MergeBamAlignment step had high numbers of contaminant reads. A snippet of one of the stderr logs is shown below:

INFO    2018-07-02 02:12:29 AbstractAlignmentMerger 86389692 Reads have been unmapped due to being suspected of being Cross-species contamination.
INFO    2018-07-02 02:12:36 AbstractAlignmentMerger Wrote 689846082 alignment records and 119640074 unmapped reads.
[Mon Jul 02 02:12:36 UTC 2018] picard.sam.MergeBamAlignment done. Elapsed time: 434.47 minutes.
Runtime.totalMemory()=3040870400
Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms3000m -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar MergeBamAlignment --VALIDATION_STRINGENCY SILENT --EXPECTED_ORIENTATIONS FR --ATTRIBUTES_TO_RETAIN X0 --ALIGNED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/6d545972-e762-4770-9e8c-7703f11301dc/PreProcessingForVariantDiscovery_GATK4/cc708444-df39-4766-aec7-92536ad81ea7/call-SamToFastqAndBwaMem/shard-0/916_3063_1_3_h5nlwalxx_3.unmapped.unmerged.bam --UNMAPPED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/uBAMs/916_3063_1_3/916_3063_1_3_h5nlwalxx_3.unmapped.bam --OUTPUT 916_3063_1_3_h5nlwalxx_3.unmapped.aligned.unsorted.bam --REFERENCE_SEQUENCE /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PAIRED_RUN true --SORT_ORDER unsorted --IS_BISULFITE_SEQUENCE false --ALIGNED_READS_ONLY false --CLIP_ADAPTERS false --MAX_RECORDS_IN_RAM 2000000 --ADD_MATE_CIGAR true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant --PROGRAM_RECORD_ID bwamem --PROGRAM_GROUP_VERSION 0.7.15-r1140 --PROGRAM_GROUP_COMMAND_LINE bwa mem -K 100000000 -p -v 3 -t 16 -Y /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PROGRAM_GROUP_NAME bwamem --UNMAPPED_READ_STRATEGY COPY_TO_TAG --ALIGNER_PROPER_PAIR_FLAGS true --UNMAP_CONTAMINANT_READS true

I would like to be able to extract these contaminant reads and BLAST them to investigate where they came from. However, I didn't see a way to pull out these contaminant reads using command line options, so I tried parsing the bam files.

To extract unmapped reads, I used samtools to filter for the '0x4' flag in the analysis ready bam file. The reads I pulled out from this process seemed to be reads with 0 mapping quality, and not necessarily contaminant reads. After some more digging, I found out that reads are marked for contamination when they are 1) soft clipped on both ends and 2) align with less than 32 bases. As a result, I wrote a simple "contaminant-finder" that parses my analysis ready bam file, and here's an example of a read I personally marked as contaminant:

E00489:73:H7VLWALXX:2:1121:2290:14635   83  chr1    10166   6   95S11M44S   =   10159   -18 ACCTAAACCTAACCCTAGCCCTAAACCTAACCCTAACTCTAACCCTATCCTAACCCTACCCCTAACCTAACCCTACCCCCCAACCCTAACCCTAAACCTAACCCTAGCCAAAAGCCTAAACCTAACCCTAGCCCTAGCCCTAGCCAAAAC  5?????5?????????????????????5????????????5??????????5??5?5+??????????5?5???+???????????5???????5????????55???5??5??????5?????????????????????????????5  MC:Z:18M132S    MD:Z:0C10   PG:Z:MarkDuplicates RG:Z:H7VLWALXX.1    NM:i:1  MQ:i:10 UQ:i:12 AS:i:42

According to the cigar string "95S11M44S", this read clearly matches the criteria for contamination. Strangely, the flag "83" does not have the '0x4' bit marked, which means it was not unmapped according to SAM file format specifications.

In summary, here are my questions:
1. How can I extract reads marked as contaminant from my analysis ready bam file (produced after alignment, duplicate marking, and BQSR)?
2. How exactly does MergeBamAlignment "mark" reads as contaminant without raising the '0x4' bit?

Thank you for your time!
Lee

ASEReadCounter ouputs only header without error

$
0
0

Hi,

I have a problem similar to this:
HaplotypeCaller output header and one position recode without error

but with GATK4 ASEReadCounter. I have VCF file created using Samtools mpileup, compressed with bgzip, duplicate variants removed by bcftools norm and VCF file indexed by tabix. RNA-Seq BAM file was aligned to hg38 with STAR and sorted and indexed with Samtools. I also used AddOrReplaceReadGroups and ValidateSamFile for the BAM file.

Command:

module load gatk-env/4.0.2.1
gatk ASEReadCounter \--input sample_readgroups.bam \--variant sample.vcf.gz \--disable-tool-default-read-filters true \--reference hg38.fa \--output sample.ASEReadCounter.rtable \--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

What happens:

----------------------------------------------------------------------------
  Setting up new environment, removing all currently loaded modules
----------------------------------------------------------------------------
  Loading new modules:
    r-env/3.2.5    rstudio    java/oracle/1.8    picard/2.13.2
----------------------------------------------------------------------------
  Setting up new environment, removing all currently loaded modules
----------------------------------------------------------------------------
  Loading new modules:
    gcc/4.9.3    intelmpi/5.1.1    mkl/11.3.0    r-app/3.2.5
r-app R-3.2.5 environment loaded

rstudio v0.99.1196 environment loaded

Start RStudio with rstudio

For interactive work consider using taito-shell.csc.fi

Loading application Oracle Java 1.8 
picard 2.13.2 environment loaded

Due to MODULEPATH changes the following have been reloaded:
  1) intelmpi/5.1.1  2) mkl/11.3.0

10:15:09.824 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/homeappl/appl_taito/bio/GATK4/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:15:10.259 INFO  ASEReadCounter - ------------------------------------------------------------
10:15:10.260 INFO  ASEReadCounter - The Genome Analysis Toolkit (GATK) v4.0.2.1
10:15:10.260 INFO  ASEReadCounter - For support and documentation go to https://software.broadinstitute.org/gatk/
10:15:10.261 INFO  ASEReadCounter - Executing as randelin@c981 on Linux v2.6.32-642.15.1.el6.x86_64 amd64
10:15:10.261 INFO  ASEReadCounter - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_151-b12
10:15:10.261 INFO  ASEReadCounter - Start Date/Time: July 5, 2018 10:15:09 AM GMT
10:15:10.261 INFO  ASEReadCounter - ------------------------------------------------------------
10:15:10.262 INFO  ASEReadCounter - ------------------------------------------------------------
10:15:10.264 INFO  ASEReadCounter - HTSJDK Version: 2.14.3
10:15:10.264 INFO  ASEReadCounter - Picard Version: 2.17.2
10:15:10.264 INFO  ASEReadCounter - HTSJDK Defaults.COMPRESSION_LEVEL : 1
10:15:10.264 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:15:10.265 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:15:10.265 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:15:10.265 INFO  ASEReadCounter - Deflater: IntelDeflater
10:15:10.265 INFO  ASEReadCounter - Inflater: IntelInflater
10:15:10.265 INFO  ASEReadCounter - GCS max retries/reopens: 20
10:15:10.265 INFO  ASEReadCounter - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
10:15:10.265 INFO  ASEReadCounter - Initializing engine
10:15:11.286 INFO  FeatureManager - Using codec VCFCodec to read file file:///sample.vcf.gz
10:15:11.442 INFO  ASEReadCounter - Done initializing engine
10:15:11.448 INFO  ProgressMeter - Starting traversal
10:15:11.448 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
10:15:21.460 INFO  ProgressMeter -          chr1:368742              0.2                117000         701228.6
10:15:31.997 INFO  ProgressMeter -          chr1:629976              0.3                213000         621958.3
10:15:42.018 INFO  ProgressMeter -          chr1:698491              0.5                236000         463199.2
10:15:44.969 WARN  ASEReadCounter - Ignoring site: variant is not het at postion: chr1:1302087
10:15:45.124 WARN  ASEReadCounter - Ignoring site: cannot run ASE on non-biallelic sites: [VC Unknown @ chr1:1319980 Q0.00 of type=MIXED alleles=[G*, <*>, A] attr={BQB=1, DP=16, I16=[1, 4, 2, 6, 417, 38245, 629, 53651, 100, 2000, 160, 3200, 110, 2600, 191, 4603], MQ0F=0, MQB=1, MQSB=1, QS=[0.384615, 0.615385, 0], RPB=0.908075, SGB=-0.651104, VDB=0.00836812} GT=PL   85,0,48,100,72,141 filters=
10:15:45.127 WARN  ASEReadCounter - Ignoring site: cannot run ASE on non-biallelic sites: [VC Unknown @ chr1:1320130 Q0.00 of type=MIXED alleles=[G*, <*>, T] attr={BQB=0.477935, DP=17, I16=[3, 4, 5, 3, 470, 31682, 494, 31808, 140, 2800, 160, 3200, 110, 2306, 125, 2683], MQ0F=0, MQB=1, MQSB=1, QS=[0.466667, 0.533333, 0], RPB=0.921243, SGB=-0.651104, VDB=0.890472} GT=PL  83,0,74,104,98,164 filters=
10:15:52.045 INFO  ProgressMeter -         chr1:1566872              0.7                758000        1120279.8
10:15:56.002 WARN  ASEReadCounter - Ignoring site: cannot run ASE on non-biallelic sites: [VC Unknown @ chr1:1796907 Q0.00 of type=MIXED alleles=[C*, <*>, T] attr={BQB=0.5, DP=6, I16=[3, 1, 0, 2, 277, 19213, 124, 7738, 80, 1600, 40, 800, 51, 695, 28, 392], MQ0F=0, MQB=1, MQSB=0.861511, QS=[0.666667, 0.333333, 0], RPB=0.75, SGB=-0.453602, VDB=0.02} GT=PL 19,0,53,31,59,78 filters=
.
.
.
(continues the same way)
.
.
.
10:56:25.174 WARN  ASEReadCounter - Ignoring site: cannot run ASE on non-biallelic sites: [VC Unknown @ chr22:50445452 Q0.00 of type=MIXED alleles=[C*, <*>, T] attr={BQB=0.991201, DP=53, I16=[7, 19, 6, 14, 1786, 128860, 1271, 83753, 520, 10400, 400, 8000, 501, 11051, 386, 8524], MQ0F=0, MQB=1, MQSB=1, QS=[0.565217, 0.434783, 0], RPB=0.95936, SGB=-0.692067, VDB=0.171326} GT=PL  115,0,137,193,197,255 filters=
10:56:25.317 INFO  ProgressMeter -       chr22:50466875             41.2             375434000        9105591.3
10:56:37.922 INFO  ProgressMeter -    KI270733.1:177566             41.4             375720000        9066332.5
10:56:47.925 INFO  ProgressMeter -        chrX:17782656             41.6             377480000        9072304.7
10:56:58.022 INFO  ProgressMeter -        chrX:48266117             41.8             379323000        9079875.6
10:57:08.024 INFO  ProgressMeter -        chrX:56859067             41.9             380706000        9076761.4
10:57:18.025 INFO  ProgressMeter -        chrX:75064330             42.1             382619000        9086261.8
10:57:28.050 INFO  ProgressMeter -       chrX:104532016             42.3             384291000        9089900.6
10:57:38.056 INFO  ProgressMeter -       chrX:133362446             42.4             386383000        9103474.1
10:57:48.187 INFO  ProgressMeter -       chrX:154401268             42.6             388123000        9108235.1
10:57:58.188 INFO  ProgressMeter -         chrY:1392302             42.8             388731000        9086958.6
10:58:08.784 INFO  ProgressMeter -    GL000220.1:115358             43.0             390465000        9089967.3
10:58:18.815 INFO  ProgressMeter -    KI270744.1:125108             43.1             390951000        9065996.4
10:58:28.827 INFO  ProgressMeter -    GL339449.2:846918             43.3             393140000        9081616.5
10:58:38.832 INFO  ProgressMeter -     KI270832.1:10337             43.5             395213000        9094489.1
10:58:49.419 INFO  ProgressMeter -    KI270853.1:403868             43.6             397072000        9100299.4
10:58:59.456 INFO  ProgressMeter -    KI270853.1:665159             43.8             397231000        9069173.3
10:59:09.465 INFO  ProgressMeter -    KI270853.1:942620             44.0             397407000        9038766.6
10:59:19.467 INFO  ProgressMeter -   KI270857.1:1171038             44.1             398301000        9024882.4
10:59:29.480 INFO  ProgressMeter -    KI270875.1:189541             44.3             400159000        9032866.6
10:59:39.725 INFO  ProgressMeter -    KI270897.1:311024             44.5             400922000        9015300.9
10:59:54.064 INFO  ProgressMeter -    KI270897.1:449611             44.7             400979000        8968387.6
11:00:04.076 INFO  ProgressMeter -   GL000251.2:3109525             44.9             401940000        8956457.8
11:00:14.089 INFO  ProgressMeter -   KI270908.1:1250971             45.0             403993000        8968849.4
11:00:24.597 INFO  ProgressMeter -   GL000252.2:4519995             45.2             405265000        8962242.8
11:00:34.598 INFO  ProgressMeter -   GL949749.2:1057485             45.4             406585000        8958412.1
11:00:44.600 INFO  ProgressMeter -     GL000255.2:23082             45.6             407776000        8951774.4
11:00:54.620 INFO  ProgressMeter -    GL949751.2:181561             45.7             408847000        8942501.6
11:01:04.727 INFO  ProgressMeter -     KI270938.1:81508             45.9             410141000        8937873.7
11:01:26.276 INFO  ProgressMeter -            chrM:2078             46.2             410300000        8871901.2
11:01:53.093 INFO  ProgressMeter -            chrM:3078             46.7             410301000        8787001.9
11:02:07.769 INFO  ProgressMeter -            chrM:4078             46.9             410302000        8741233.7
11:02:19.157 INFO  ProgressMeter -            chrM:5078             47.1             410303000        8706051.4
11:02:32.930 INFO  ProgressMeter -            chrM:6078             47.4             410304000        8663873.3
11:03:00.646 INFO  ProgressMeter -            chrM:7078             47.8             410305000        8580202.6
11:03:43.980 INFO  ProgressMeter -            chrM:8078             48.5             410306000        8452562.9
11:04:15.210 INFO  ProgressMeter -            chrM:9078             49.1             410307000        8362911.1
11:04:41.972 INFO  ProgressMeter -           chrM:10078             49.5             410308000        8287588.3
11:04:57.510 INFO  ProgressMeter -           chrM:11078             49.8             410309000        8244483.9
11:05:26.815 INFO  ProgressMeter -           chrM:12078             50.3             410310000        8164379.3
11:05:38.066 INFO  ProgressMeter -           chrM:14078             50.4             410312000        8134069.1
11:05:48.378 INFO  ProgressMeter -           chrM:15078             50.6             410313000        8106469.4
11:05:58.376 INFO  ASEReadCounter - No reads filtered by: AllowAllReadsReadFilter
11:05:58.376 INFO  ProgressMeter -           chrM:16078             50.8             410314491        8079898.7
11:05:58.376 INFO  ProgressMeter - Traversal complete. Processed 410314491 total loci in 50.8 minutes.
11:05:58.376 INFO  ASEReadCounter - Shutting down engine
[July 5, 2018 11:05:58 AM GMT] org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter done. Elapsed time: 50.81 minutes.
Runtime.totalMemory()=2045640704
Using GATK jar /homeappl/appl_taito/bio/GATK4/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /homeappl/appl_taito/bio/GATK4/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar ASEReadCounter --input sample_readgroups.bam --variant sample.vcf.gz --disable-tool-default-read-filters true --reference hg38.fa --output sample.ASEReadCounter.rtable

Output rtable contains only header:

contig  position    variantID   refAllele   altAllele   refCount    altCount    totalCount  lowMAPQDepth    lowBaseQDepth   rawDepth    otherBases  improperPairs


What should I do?

The same thing happens also when I enable default read filters.

Here is a my vcf:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.3.1+htslib-1.3.1
##samtoolsCommand=samtools mpileup -v -u -f hg38.fa -l sample.bed -o sample.vcf sample_sorted.bam
##reference=file:///hg38.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=KI270706.1,length=175055>
##contig=<ID=KI270707.1,length=32032>
##contig=<ID=KI270708.1,length=127682>
##contig=<ID=KI270709.1,length=66860>
##contig=<ID=KI270710.1,length=40176>
##contig=<ID=KI270711.1,length=42210>
##contig=<ID=KI270712.1,length=176043>
##contig=<ID=KI270713.1,length=40745>
##contig=<ID=KI270714.1,length=41717>
##contig=<ID=chr2,length=242193529>
##contig=<ID=KI270715.1,length=161471>
##contig=<ID=KI270716.1,length=153799>
##contig=<ID=chr3,length=198295559>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=chr4,length=190214555>
##contig=<ID=GL000008.2,length=209709>
##contig=<ID=chr5,length=181538259>
##contig=<ID=GL000208.1,length=92689>

.
.
.
(continues the same way)
.
.
.
##contig=<ID=KI270930.1,length=200773>
##contig=<ID=KI270931.1,length=170148>
##contig=<ID=KI270932.1,length=215732>
##contig=<ID=KI270933.1,length=170537>
##contig=<ID=GL000209.2,length=177381>
##contig=<ID=chrM,length=16569>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##bcftools_normVersion=1.4+htslib-1.4
##bcftools_normCommand=norm -d any -O z -o sample.vcf.gz sample.vcf.gz; Date=Thu Jul  5 10:33:29 2018
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample_sorted.bam
chr1    1302087 .       G       <*>     0       .       DP=3;I16=0,2,0,0,139,9661,0,0,40,800,0,0,49,1201,0,0;QS=1,0;MQ0F=0      PL      0,6,36
chr1    1319980 .       G       A,<*>   0       .       DP=16;I16=1,4,2,6,417,38245,629,53651,100,2000,160,3200,110,2600,191,4603;QS=0.384615,0.615385,0;VDB=0.00836812;SGB=-0.651104;RPB=0.908075;MQB=1;MQSB=1;BQB=1;MQ0F=0    PL      85,0,48,100,72,141
chr1    1320130 .       G       T,<*>   0       .       DP=17;I16=3,4,5,3,470,31682,494,31808,140,2800,160,3200,110,2306,125,2683;QS=0.466667,0.533333,0;VDB=0.890472;SGB=-0.651104;RPB=0.921243;MQB=1;MQSB=1;BQB=0.477935;MQ0F=0       PL      83,0,74,104,98,164
.
.
.
(etc.)

I am out of ideas. Thank you so much for your time.

i_variant_quality_by_depth/i_genotype_quality interpretation

$
0
0

When interpreting the output of HaplotypeCaller, what do the i_variant_quality_by_depth and i_genotype_quality
columns represent and which of these would be a good value on which to base an assessment of confidence in the variant call and quality? What scale are they on? Or is there a different column that would be better?

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>