sorting a BAM file with PICARD

July 21, 2018, 7:12 pm

≫ Next: four issues with GATK HaplotypeCaller

≪ Previous: 3 ways to merge VCF files, which one is preferred?

Dear all,

would you please advise: I am using PICARD in order to sort a BAM based on read name (it is a BAM file from EGA that contains cancer sequencing data), and when I do run PICARD SortSam, I am getting the following error (below), and the file does not get sorted. Is there a way i could fix it ? Thank you very much !

**Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 957453876, Read name HWI-ST7001002R:223:C14GPACXX:3:1305:7471:56486, MAPQ should be 0 for unmapped read"
**
The command from PICARD is :

**java -jar $PICARD SortSam \
I=$FILE \
O="${FILE}.sorted.picard.queryname.bam" **

↧

four issues with GATK HaplotypeCaller

July 22, 2018, 7:50 am

≫ Next: Which training sets / arguments should I use for running VQSR?

≪ Previous: sorting a BAM file with PICARD

Hi,

I have the following questions for HaplotypeCaller. Can someone please kindly help me out here?

Some of the other GATK tools such as IndelRealigner and BaseRecalibrator use the option of "-know" (or maybe also --knownSites ?) for known VCF files. However, it seems that HaplotypeCaller does not support this, instead, it takes "--dbsnp" option. Can someone please confirm this and explain why this is the case? Can't I supply 1000GP data as a reference for GATK HaplotypeCaller?
I found that I could not use .gz files for the "--dbsnp" option and for the "-R" option. As you see, the unzipped dbSNP file is huge. So, GATK can only take this huge plain text file? I hope this can be improved.
To use GATK v4 to process WGS data, can I simply use "gatk HaplotypeCaller -R -I -O -ERC --dbsnp", or do I need to specify other parameters such as "--genotyping_mode DISCOVERY -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3"? I found that some options are named differently between GATK v3 and v4, for example, "-o" vs. "-O". Sometimes this could be confusing.
I downloaded the appendix files from https://software.broadinstitute.org/gatk/download/bundle and run the following command but got an error, pasted below:
gatk HaplotypeCaller -R ucsc.hg19.fasta -I jie.bam -O jie.gvcf -ERC GVCF --dbsnp dbsnp_138.hg19.vcf,

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = chrM / 16571
contig reads = chrM / 16569.
reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000 194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random , chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11 gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl0002 07_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000 214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, c hrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl0002 40, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_g l000249]
reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr 19, chr20, chr21, chr22, chrX, chrY, chrM]

Thank you very much & best regards,
Jie

↧

Which training sets / arguments should I use for running VQSR?

August 2, 2012, 7:05 am

≫ Next: (How to part II) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: four issues with GATK HaplotypeCaller

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

Explanation of resource datasets
Important notes about annotations
Important notes about exome experiments
Argument recommendations for VariantRecalibrator
Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

True sites training resource: HapMap 
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
True sites training resource: Omni 
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Non-true sites training resource: 1000G 
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

True sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.
The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.
You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf\
   -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \

↧

(How to part II) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 10:32 am

≫ Next: ERROR: GTs cannot be missing for some samples if they are available for others in the record

≪ Previous: Which training sets / arguments should I use for running VQSR?

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

hcc1143_T_clean.modelBegin.seg
hcc1143_T_clean.modelFinal.seg
hcc1143_T_clean.cr.seg
hcc1143_T_clean.modelBegin.af.param
hcc1143_T_clean.modelBegin.cr.param
hcc1143_T_clean.modelFinal.af.param
hcc1143_T_clean.modelFinal.cr.param
hcc1143_T_clean.hets.normal.tsv
hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
- GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
- Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
- If given only allelic counts data, ModelSegments does not apply intervals.
Performs multidimensional kernel segmentation (1, 2).
- Uses allelic counts within each copy-ratio interval for each contig.
- Uses denoised copy ratios and heterozygous allelic counts.
Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
- Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
- Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
- Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters
- The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters
- The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
- To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.
T_modelbegin.modeled.png

8B. Default smoothing gives 398 segments.
T_modelfinal.modeled.png

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.
T_increase_smoothing_1.modeled.png

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
T_caseonly.modeled.png

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
T-matched-normal_just_allelic.modeled.png

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.

8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.

8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.

8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.

Footnotes

[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
Iteratively perform segment smoothing.
Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

ERROR: GTs cannot be missing for some samples if they are available for others in the record

April 29, 2013, 2:37 pm

≫ Next: What's in the resource bundle and how can I get it?

≪ Previous: (How to part II) Sensitively detect copy ratio alterations and allelic segments

Hi:

I'm trying to work with exome chip data within gatk and so I've converted my ped / map files to vcf format using plink seq. I am getting the following error: When I run select variants in an attempt to pull out a few individual IDs, I get the error,

" GTs cannot be missing for some samples if they are available for others in the record".

I'm assuming this is referring to how some individuals may have "./." genotype vs. 0/0 0/1 or 1/1. Is there a way to work around this?

↧

What's in the resource bundle and how can I get it?

July 26, 2012, 6:55 am

≫ Next: https://www.healthyfigures.org/max-keto-burn/

≪ Previous: ERROR: GTs cannot be missing for some samples if they are available for others in the record

NOTE: we recently made some changes to the bundle on the FTP server; see the Resource Bundle page for details. In a nutshell: minor directory structure changes, and Hg38 bundle now mirrors the cloud version.

1. Accessing the bundle

See the Resource Bundle page. In a nutshell, there's a Google Cloud bucket and an FTP server. The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.

2. Grch38/Hg38 Resources: the soon-to-be Standard Set

This contains all the resource files needed for Best Practices short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish).

All resources below this are available only on the FTP server, not on the cloud.

3. b37 Resources: the Standard Data Set pending completion of the Hg38 bundle

Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
dbSNP in VCF. This includes two files:
- A recent dbSNP release (build 138)
- This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
HapMap genotypes and sites VCFs
OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
- 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
- Mills_and_1000G_gold_standard.indels.b37.sites.vcf
The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf
A large-scale standard single sample BAM file for testing:
- NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
- A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

4. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

5. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

6. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

↧

https://www.healthyfigures.org/max-keto-burn/

August 2, 2018, 5:04 am

≫ Next: Why are INFO fields moved around by GATK4 - SelectVariants call?

≪ Previous: What's in the resource bundle and how can I get it?

Max Keto Burn and under 6 grams of protein, making eggs a perfect help for a Max Keto Burn way of life. Additionally, eggs have been appeared to actuate hormones keep eating waste sustenance and fat, since you won't manage the can lose around 13kg out of 28 days without changing their eating routine or exercise slants. It isconditions. The underwriting for weight diminish related with Max Keto Burn has repudiated .
https://www.healthyfigures.org/max-keto-burn/

↧

Why are INFO fields moved around by GATK4 - SelectVariants call?

August 2, 2018, 5:06 am

≫ Next: http://www.usahealthguide.com/rvxadryl/

≪ Previous: https://www.healthyfigures.org/max-keto-burn/

Hi there,

I'm trying to subset a multi-samples VCF file by selected sample IDs. I'm using GATK4 - SelectVariants.

/g/data3/a32/Software/GATK/gatk-4.0.4.0/gatk --java-options "-Xmx8g -Djava.io.tmpdir=$working_FS" SelectVariants \
 -R "/g/data3/a32/References_and_Databases/hg38.noalt.decoy.bwa/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa" \
 -V $input_VCFfile \
 -O $output_VCFfile \
 -sn $tmp_file2 \
 --exclude-non-variants

The task completes and I can see that it has subset for the samples I want.

However, I noticed that the INFO fields from the original full VCF are moved around in the SelectVariants created VCF.

Is there a way to retain the same position of the INFO fields in the subset file?

Hope you can help. Thanks
Eddie Ip

↧

http://www.usahealthguide.com/rvxadryl/

August 2, 2018, 6:19 am

≫ Next: Variant Calling

≪ Previous: Why are INFO fields moved around by GATK4 - SelectVariants call?

Rvxadryl best abnormality. Effects that can impact women who have had an ovarian clearing, no estrogen and no Rvxadryl. This is called early careful menopause. To empower her still young women to recover an on a fantastically essential level principal sexual synchronization, Rvxadryl patches exist. In men , this kind of hormone treatment by increment of Rvxadryl has been used for a long time, on a to an uncommon .

http://www.usahealthguide.com/rvxadryl/

↧

Variant Calling

August 2, 2018, 9:03 am

≫ Next: HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

≪ Previous: http://www.usahealthguide.com/rvxadryl/

I have two whole exome sequence data files. I have aligned them and and sorted them using BWA and picard. I want to create a single VCF file using these two samples.

As i am a newbie to computational science is there a GATK step wise tutorial that could take me through this process?

↧

HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

July 11, 2018, 6:04 am

≫ Next: http://www.viralsupplement.com/pure-turmeric-curcumin/

≪ Previous: Variant Calling

Dear all,
I'm trying to update our pipeline to identify SNPs and mutations in RNA sequencing samples. I spent a lot of time online trying to figure out how to adjust the previous commands to the newest releases of the various tools, but the final steps filter out all of the bases - producing a VCF file that doesn't include anything nut headers.
Attached is the general work flow, trying to analyze ERR361240 from SRA:

STAR --runThreadN 32 --genomeDir /mnt/lustre/hms-01/fs01/galaxy1/reference_data/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex --readFilesIn file.fastq --outSAMtype BAM Unsorted --outSAMmapqUnique 60 --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 \

note that I already changed the output to mark 60 rather than 255

**picard.jar **AddOrReplaceReadGroups I=Aligned.out.bam O=Aligned.out_rg.bam RGID=null RGLB=lb RGPL=illumina RGPU=pu RGSM=ES

picard.jar ReorderSam I=Aligned.out_rg.bam O=Aligned.out_rg_sorted.bam R=/mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa

picard.jar SortSam I=Aligned.out_rg_sorted.bam O=Aligned.out_rg_sorted2.bam SORT_ORDER=coordinate CREATE_INDEX=true

gatk SplitNCigarReads -R /mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa -I Aligned.out_rg_sorted2.bam -O split.bam

gatk --java-options -Xmx4g **HaplotypeCaller **-R /mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa -I split.bam --dont-use-soft-clipped-bases --stand-call-conf 20.0 -O variants_output.vcf

It seems that the SplitNCigarReads worked, as the log notes:
14:32:31.381 INFO SplitNCigarReads - No reads filtered by: AllowAllReadsReadFilter
14:32:31.381 INFO ProgressMeter - KI270752.1:25118 140.1 121004130 863481.5
14:32:31.381 INFO ProgressMeter - Traversal complete. Processed 121004130 total reads in 140.1 minutes.
INFO 2018-07-11 14:32:33 SortingCollection Creating merging iterator from 137 files
14:35:23.427 INFO SplitNCigarReads - Shutting down engine
[July 11, 2018 2:35:23 PM IDT] org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads done. Elapsed time: 143.02 minutes.
Runtime.totalMemory()=3614441472

However, in the next step the following is specified:

14:49:16.138 INFO HaplotypeCaller - 68471979 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
18480956 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
18480956 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
18480956 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
18480956 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
18480956 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
18480956 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
18480956 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
18480956 read(s) filtered by: MappingQualityReadFilter
49991023 read(s) filtered by: WellformedReadFilter

14:49:16.138 INFO ProgressMeter - KI270394.1:901 13.7 10332600 751583.9
14:49:16.138 INFO ProgressMeter - Traversal complete. Processed 10332600 total regions in 13.7 minutes.
14:49:16.146 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
14:49:16.146 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
14:49:16.146 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
14:49:16.146 INFO HaplotypeCaller - Shutting down engine
[July 11, 2018 2:49:16 PM IDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 13.77 minutes.
Runtime.totalMemory()=1112539136

I honestly don't know where the problem is. This sample worked well with an older version of GATK, using TopHat in order to align it. Any help would be highly appreciated.

Thanks,
Yishai

↧

http://www.viralsupplement.com/pure-turmeric-curcumin/

August 2, 2018, 11:01 am

≫ Next: ClusterCrosscheckMetrics results missing distance metric

≪ Previous: HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

That the people who purchase this thing can feel sure that they are settling on a composed decision. Second, this condition is a prime response for diminishing bothering all through the body. For the Pure Turmeric Curcumin people who are oblivious, aggravation is an essential motivation to a considerable measure of torment issues that a large number individuals inclusion. By discarding the bothering, customers of the formula can feel .

http://www.viralsupplement.com/pure-turmeric-curcumin/

↧

ClusterCrosscheckMetrics results missing distance metric

July 23, 2018, 7:14 am

≫ Next: X chromosome haplotype caller

≪ Previous: http://www.viralsupplement.com/pure-turmeric-curcumin/

I'm using ClusterCrosscheckMetrics to compute clusters between samples which nicely outputs clusters and the number of samples per cluster, but I am missing the distance metrics that can be used for plotting dendrograms. Can feature be added to the program?

↧

X chromosome haplotype caller

July 23, 2018, 7:16 am

≫ Next: can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

≪ Previous: ClusterCrosscheckMetrics results missing distance metric

Hi!!
I am trying to call haplotypes for X chromosome, I´ve read all the threads and didn´t find any solution.
I have 42 Males (bovine samples) and I want to call just X chromosome, so I´ve just used the X chr sequences.
My input is:
./gatk HaplotypeCaller -R bosTau8.fa -I bov_1_calmd.realigned_bam_file.bam -I bov_2_calmd.realigned_bam_file.bam .......................-I bov_42_calmd.realigned_bam_file.bam -ploidy 1 -L chrX -min-pruning 1 -min-dangling-branch-length 1 -mbq 1 -stand-call-conf 2 -O X_prueba.vcf

And I just get a lot of this:
16:14:41.212 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
and:
16:14:41.910 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples

Please help!

↧

can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

October 31, 2016, 4:17 pm

≫ Next: Run GATK for multiple populations

≪ Previous: X chromosome haplotype caller

I'm interested in getting simple "heterozygous" or "homozygous" designations for all of the samples/SNPs in my multisample VCF file. In the past, I have been using the -GF GT option in VariantsToTable, and then annotating my basecalls in Excel as either heterozygous or homozygous. This takes forever since Excel isn't really built for big data like this. Is there a simple way to output all of the SNPs as 0/1, 0/0, 0/1, or 1/1 instead of C/A, A/A, G/T, C/C?

↧

Run GATK for multiple populations

July 23, 2018, 7:55 am

≫ Next: How to force MuTect2 genotype all sites within intervals?

≪ Previous: can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

Hi, I have sequencing data for two sister populations which I wish to call genotypes for. While I expect variants to be shared between populations, the allele frequencies may be different. Is there a way I can specify that these samples come from two source populations for GATK's joint haplotype caller?

Thanks!

↧

How to force MuTect2 genotype all sites within intervals?

September 29, 2016, 8:30 am

≫ Next: Problem encountered when running MarkDuplicatesSpark using 4.0.6

≪ Previous: Run GATK for multiple populations

I cannot generate genotypes for all input sites with MuTect2 (-L input.vcf). I tried --output_mode EMIT_ALL_SITES and -gt_mode GENOTYPE_GIVEN_ALLELES without luck. Is there an equivalent option to --force_output in MuTect? I know it is more complicated with indels where it is not applicable to generate all possible indels, but I am only interested in the specific genotypes I used with -L.

↧

Problem encountered when running MarkDuplicatesSpark using 4.0.6

July 23, 2018, 2:30 pm

≫ Next: GATK4 missed some true positive variants

≪ Previous: How to force MuTect2 genotype all sites within intervals?

Hi Board folks,

I am trying to run the MarkDuplicatesSpark on my spark cluster and keep getting following exceptions:

Caused by: java.lang.NullPointerException
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.Fragment.(Fragment.java:37)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newFragment(MarkDuplicatesSparkRecord.java:32)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:115)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:372)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$17d832cf$1(MarkDuplicatesSparkUtils.java:123)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I have tried both on my remote cluster
--spark-master spark://{my_cluster_ip}:7077 --spark-runner SPARK
and local spark
--spark-master local[16]

They both gave me the same error.

The version I am using the latest 4.0.6.

↧

GATK4 missed some true positive variants

July 23, 2018, 8:34 pm

≫ Next: CNNScoreVariants

≪ Previous: Problem encountered when running MarkDuplicatesSpark using 4.0.6

Hi,

I found GATK4 missed some true positive variants which are called by other variant callers (using TCGA dream data).

For example:
chr10 132716926 C T

chr1 212016625 G C
chr2 100472297 G T
chr6 134763510 C A
chr6 166581670 G T
chr15 23655107 A C
chr18 44530538 T G

Thanks!

↧

CNNScoreVariants

July 23, 2018, 10:38 am

≫ Next: VariantsToTable does not produce the table of all SNPs in the vcf file

≪ Previous: GATK4 missed some true positive variants

Hi,
I have tumor only calls with Mutect and would like to filter them further. My first question is, if CNNScoreVariants applicable to this? If so, I can't get it work. The command line I am using (tried both 1D and 2D models) (also tried different batch sizes in 2D model):
/GenomeAnalysisTk/4.0.5.2/gatk CNNScoreVariants -V /path/to/input_somatic_twicefiltered.vcf.gz -R /ref/hg19/Homo_sapiens_assembly19.fasta -O /path/to/output_annotated.vcf --intra-op-threads 20
However I am always stuck at this level and no progress (for days)
13:00:07.286 INFO CNNScoreVariants - Initializing engine 13:00:14.027 INFO FeatureManager - Using codec VCFCodec to read file 13:00:14.661 INFO CNNScoreVariants - Done initializing engine
I am only providing 1 sample here. Do I supposed to provide a batch? Or what else I could be doing wrong?
Thanks!

↧