Split VCF by groups for Genotype Refinement?

March 14, 2019, 6:42 am

≫ Next: Calling Somatic Variants without matched normals using GATK.

≪ Previous: New! Mutect2 for Mitochondrial Analysis

Hello!

I have a VCF with 60 samples, corresponding to 10 mice per group. Each group was derived from the same founder population 40 years ago and now exhibit highly divergent characteristics, corresponding to the selection programm they were under over that period of time.

The VCF was produced by applying GATK's practices to my read data, all samples at once.

Now that VQSR is done, I want to refine the genotypes.

Considering that groups are phenotypicaly so different and that I have 10 samples per group, I thought the most reasonable thing to do is to run CalculateGenotypePosteriors within each group, instead of doing it across the whole cohort (60 samples).

Could you please indicate if my reasoning is flawed and if it would be better to apply CalculateGenotypePosteriors on the whole cohort?

Thanks in advance!

↧

Calling Somatic Variants without matched normals using GATK.

May 30, 2018, 7:22 am

≫ Next: GATK FastaAlternateReferenceMaker not correcting fasta reference

≪ Previous: Split VCF by groups for Genotype Refinement?

Hi ,
I am new to the world of bioinformatics. I currently have sequencing data (WES) of about 45 pediatric brain tumor samples (archived FFPE), I am keen on identifying mutational burden and mutational signatures in these samples. I don't necessarily
want to discover a novel mutation and describe it's biological relevance. More use the pattern of mutational signatures to identify the causes of recurrence in tumors. The problem is like with most archived FFPE samples I don't have matched normal tissue. I am looking at the best approach to call somatic variants in these samples. Is Using gnomAD for filtering my best option? Is that a good resource for pediatric tumors? If no then what could be other potential sources for this.

Thank you for your advise.
Aditi

↧

GATK FastaAlternateReferenceMaker not correcting fasta reference

March 14, 2019, 8:23 am

≫ Next: How to make somatic variant calls from RNA-Seq data (tumor) and whole exome data (matched normal)

≪ Previous: Calling Somatic Variants without matched normals using GATK.

Hi,
I am trying to use "GATK FastaAlternateReferenceMaker" but the output fasta file is the same as the one used in input. In other words, my fasta genome file is not corrected according to the vcf file used. I am wondering wether it is a misuse of myself or a bug of the tool.
Here is the cmd line I used :
$ nice -19 gatk FastaAlternateReferenceMaker -R Dp_PB-MI_190104_dedup.fasta -V Mi_M-B-Dp_PB_B-M-freebayes_onlyindels_cov_qMi+20_SRRF-notrepeat_sorted.vcf -O Dp_PB-MI_190104_dedup_gatkcorrected.fasta &>gatk.log

Thanks in advance for your help.

Paul

↧

How to make somatic variant calls from RNA-Seq data (tumor) and whole exome data (matched normal)

March 14, 2019, 8:33 am

≫ Next: Failed to detect whether we are running on google compute engine ???

≪ Previous: GATK FastaAlternateReferenceMaker not correcting fasta reference

I'm in a situation to make variant calls from a mouse tumor cell line. We have RNA-Seq data from this tumor cell line but for the matched normal, what we have is the whole exome sequencing data. Is there any tools/workflows I can use in this case? Thanks!

↧

Failed to detect whether we are running on google compute engine ???

September 24, 2018, 4:37 am

≫ Next: Will mutect2 report multiallelic sites for indels?

≪ Previous: How to make somatic variant calls from RNA-Seq data (tumor) and whole exome data (matched normal)

Gokalps-Mac-mini:1000GVCFs sky$ gatk SelectVariants -V 1000G_CEU_chr16.vcf.gz -O 1000G_CEU_AFfilt_chr16.vcf.gz -select "AF > 0.0"
Using GATK jar /Users/sky/scripts/gatk-package-4.0.9.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/sky/scripts/gatk-package-4.0.9.0-local.jar SelectVariants -V 1000G_CEU_chr16.vcf.gz -O 1000G_CEU_AFfilt_chr16.vcf.gz -select AF > 0.0
14:35:45.842 INFO  NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sky/scripts/gatk-package-4.0.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Sep 24, 2018 2:35:47 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
WARNING: Failed to detect whether we are running on Google Compute Engine.
java.net.ConnectException: No route to host (connect failed)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
    at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.runningOnComputeEngine(ComputeEngineCredentials.java:210)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.tryGetComputeCredentials(DefaultCredentialsProvider.java:290)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentialsUnsynchronized(DefaultCredentialsProvider.java:207)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:124)
    at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:127)
    at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:100)
    at com.google.cloud.ServiceOptions.defaultCredentials(ServiceOptions.java:304)
    at com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:278)
    at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:83)
    at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:31)
    at com.google.cloud.storage.StorageOptions$Builder.build(StorageOptions.java:78)
    at org.broadinstitute.hellbender.utils.gcs.BucketUtils.setGlobalNIODefaultOptions(BucketUtils.java:360)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:183)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

14:35:47.079 INFO  SelectVariants - ------------------------------------------------------------
14:35:47.079 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.0.9.0
14:35:47.080 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
14:35:47.080 INFO  SelectVariants - Executing as sky@Gokalps-Mac-mini.local on Mac OS X v10.13.6 x86_64
14:35:47.080 INFO  SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_181-b13
14:35:47.080 INFO  SelectVariants - Start Date/Time: September 24, 2018 2:35:45 PM EET
14:35:47.080 INFO  SelectVariants - ------------------------------------------------------------
14:35:47.081 INFO  SelectVariants - ------------------------------------------------------------
14:35:47.082 INFO  SelectVariants - HTSJDK Version: 2.16.1
14:35:47.082 INFO  SelectVariants - Picard Version: 2.18.13
14:35:47.082 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:35:47.082 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:35:47.082 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:35:47.082 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:35:47.082 INFO  SelectVariants - Deflater: IntelDeflater
14:35:47.082 INFO  SelectVariants - Inflater: IntelInflater

I am getting this strange error message from time to time. It is clear that I am using the gatk local however something is wrong with google compute engine checks I guess.

↧

Will mutect2 report multiallelic sites for indels?

March 14, 2019, 6:43 pm

≫ Next: Enquiry about joint genotyping in a family study

≪ Previous: Failed to detect whether we are running on google compute engine ???

Is mutect2 capable of identifying multiallelic indels, and if so, how would they be represented in the vcf file?

many thanks

↧

Enquiry about joint genotyping in a family study

March 14, 2019, 9:53 pm

≫ Next: Is there any difference between the HaplotypeCaller employed by gatk3 and gatk4 respectively?

≪ Previous: Will mutect2 report multiallelic sites for indels?

Dear all,
I have a family of 4 and 2 of them have a certain phenotype. WES was performed to look for genotype/phenotype correlation. However, the WES capture panel was different in one of them. (i.e. 3 had the same capture BED and 1 was different, though it was still WES)
My question is, can I do joint genotyping for all 4 samples? Or shall I do individual haplotype calling instead? Or shall I do joint genotyping for the 3 samples which had the same bed and individual haplotype for the remaining sample?

Thank you very much in advance!
Best regards,
Nelson

↧

Is there any difference between the HaplotypeCaller employed by gatk3 and gatk4 respectively?

March 15, 2019, 3:07 am

≫ Next: The GATK4 HaplotypeCaller (Gvcf model) result is erro may be a bug, when sequence is simple repeat

≪ Previous: Enquiry about joint genotyping in a family study

I found some differences when I compare the results produced by GATK3 and GATK4 respectively. Both calling method is HaplotypeCaller and the number of GATK3 calling result exceed GATK4. Then I ran GATK3 HaplotypeCaller again to evaluate if there exist differences between batches. But the result indicate two results of GATK3 are the same. I have got confused with this phenomenon.

↧

The GATK4 HaplotypeCaller (Gvcf model) result is erro may be a bug, when sequence is simple repeat

March 15, 2019, 3:27 am

≫ Next: Come richiedere certificati Certificati IELTS, IDP, TOEIC, TOEFL, GMAT, ESOL

≪ Previous: Is there any difference between the HaplotypeCaller employed by gatk3 and gatk4 respectively?

hi all:
When I use GATK4 HaplotypeCaller (Gvcf model) , I have a wrong result, and I check it out may be a bug.
The result as below：

X 66765227 . A AGC, 176.73 . BaseQRankSum=2.602;ClippingRankSum=0.000;DP=222;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-3.641;RAW_MQ=790804.00;ReadPosRankSum=-6.644 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:173,27,0:200:99:0|1:66765227_A_AGC:214,0,4550,723,4632,5354:98,75,12,15
X 66765228 . A AGCAGCAC, 176.73 . BaseQRankSum=3.280;ClippingRankSum=0.000;DP=223;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-3.653;RAW_MQ=794404.00;ReadPosRankSum=-7.023 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:174,27,0:201:99:0|1:66765227_A_AGC:214,0,4550,723,4632,5355:99,75,12,15>!

The result is het, and then I check bam file, the calling bam out IGV as Fig1, markdup bam iGV as Fig2:
Fig1:

Fig2:

And I fetch sequence the read name is A00204:300:HJCT5DSXX:3:1526:17580:13839 in Fig1,2
Fig3

I find the calling bamout file sequence was cuted down from this region ATCCAGAACCCGGGCCCCAGGCACCCAGAGGCCGCGAGCGCAGCACCTCCCGGCGCCAGTTTGCTGCTGCTGC
So, when mapping to The reference sequence, the rest of the sequence move right 9bp.

when I use gatk3， the result is right.

gatk3 cmd: java -Xms1g -Xmx30g -jar /home/zgong/GATK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller --read_filter BadCigar --read_filter NotPrimaryAlignment -R /bioinfo/data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa - -dbsnp /bioinfo/data/SNP/SNP149_GRCh37.vcf --output_mode EMIT_ALL_SITES -L /home/xzheng/data/exon/hg19_exon_v7.bed -I GW9B0025A03.markdup.bam -o test_gatk3
gatk4 cmd: /home/xzheng/software/GATK4/gatk-4.0.1.1/gatk --java-options -Xmx20g HaplotypeCaller -R /bioinfo/data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -ERC GVCF --genotyping-mode DISCOVERY -I GW9B0025A03.markdup.bam -O test.g.vcf.gz --read-filter PrimaryLineReadFilter --max-reads-per-alignment-start 0 --kmer-size 10 --kmer-size 15 --kmer-size 25 --dbsnp /bioinfo/data/SNP/SNP149_GRCh37.vcf - L /home/xzheng/data/exon/hg19_exon_v7.bed --native-pair-hmm-threads 20 -bamout out.bam

↧

Come richiedere certificati Certificati IELTS, IDP, TOEIC, TOEFL, GMAT, ESOL

March 15, 2019, 4:45 am

≫ Next: Instalation - Picard

≪ Previous: The GATK4 HaplotypeCaller (Gvcf model) result is erro may be a bug, when sequence is simple repeat

whatsapp: +237683785078 -
Come richiedere certificati Certificati IELTS, IDP, TOEIC, TOEFL, GMAT, ESOL, certificati di laurea, diplomi, carta d'identità, patente di guida, certificato di nascita, certificato di matrimonio, certificato di proprietà di terreni / proprietà e molti altri documenti.
DI PIÙ...
Acquista certificati IELTS. IDP TOEFL, GMAT, ESOL, certificati di laurea, diplomi, carta d'identità, patente di guida, certificato di nascita, certificato di matrimonio, certificato di proprietà di terra / proprietà e molti altri documenti. (((Ieltsonline10@gmail.com)))
Siamo un team di specialisti IT orientati ad aiutarvi a ottenere la certificazione TOEFL, IELTS, IDP, ESOL, GMAT CELTA / DELTA, LAUREA, DIPLOMAS e altri certificati di lingua inglese oltre a carta d'identità, patente di guida. Produciamo TOEFL e IELTS, ESOL e CELTA / DELTA, DEGREE, DIPLOMAS Lingua inglese per te con ogni autenticazione e verifica.
ACQUISTA I CERTIFICATI IELTS negli Emirati Arabi e in Arabia Saudita. Vendiamo ... Acquista certificati IELTS e TOEFL, ESOL registrati senza partecipare all'esame
Per qualsiasi richiesta, contattaci:
E-mail: ieltsonline10@gmail.com
skype: markdocu10
whatsapp: +237683785078
sito /
Acquista il certificato di prova falso IELTS-TOEFL QUALSIASI PUNTEGGIO con
Certificato IELTS senza esame in Dubai IELTS Certificate senza esame in Qatar
Acquisto e vendita di certificati IELTS originali online in Austria
Vendo certificati IELTS originali online in Italia, Spagna, Portogallo
Vendita di certificati IELTS originali online in Italia
Vendere certificati IELTS originali online in Filippine
Vendita di certificati IELTS originali online in Spagna
Vendere certificati IELTS originali online in Medio Oriente
Vendita di certificati IELTS originali online in Kuwait
Vendita di certificati IELTS originali online in Qatar
Vendere certificati IELTS originali online in Gran Bretagna
Vendere certificati IELTS originali online in Portogallo

Acquista il certificato Ielts originale senza esame negli Emirati Arabi Uniti, negli EAU, in DUbai
Acquista certificato Ielts originale senza esame in Colombia
Acquista il certificato Ielts originale senza esame in Nepal
Acquista il certificato Ielts originale senza esame in Ungheria
Acquista il certificato Ielts originale senza esame in Brasile
Acquista il certificato Ielts originale senza esame in Indonesia
Acquista il certificato Ielts originale senza esame in Spagna
Acquista certificato Ielts originale senza esame in Italia
Acquista il certificato Ielts originale senza esame in Portogallo

certificato Ielts originale, acquistare un certificato falso di Ielts
certificato Ielts originale, acquistare un certificato falso di Ielts
certificato Ielts originale, acquistare un certificato falso di Ielts
certificato Ielts originale, acquistare un certificato falso di Ielts
certificato Ielts originale, acquistare un certificato falso di Ielts
Possiamo imbrogliare in IELTS o fare scorciatoie o pagare soldi e ottenere un certificato
Vendiamo certificato IELTS registrato
Acquista il certificato ielts originale
Schede per studenti, Carte internazionali, Carte private, Certificati di adozione, Certificati di battesimo, Certificati di nascita, Certificati di morte, Certificati di divorzio, Certificati matrimoniali, Certificati personalizzati, Diplomi di maturità, G.E.D. Diplomi, Diplomi di scuola domestica, Diplomi universitari, Diplomi universitari, Certificati di abilitazione commerciale, Convalida numero SSN, Carte verdi statunitensi, Contraffazione dollari / euro
Per qualsiasi richiesta, contattaci:
E-mail: ieltsonline10@gmail.com
skype: markdocu10
whatsapp: +237683785078

↧

Instalation - Picard

December 15, 2015, 7:21 am

≫ Next: Picard CreateSequenceDictionary shows ERROR: Option 'REFERENCE' is required.

≪ Previous: Come richiedere certificati Certificati IELTS, IDP, TOEIC, TOEFL, GMAT, ESOL

Hi all,

I have downloaded Picard and htsjdk from gitgub as indicated in the documentation. However, when I go to the Picard root and type "ant", I get the following error:

compile-samtools:
[javac] picard/build.xml:537: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 407 source files to picard/htsjdk/classes
[javac] javac: invalid target release: 1.8
[javac] Usage: javac

[javac] use -help for a list of possible options

BUILD FAILED
picard/build.xml:139: The following error occurred while executing this line:
picard/htsjdk/build.xml:96: The following error occurred while executing this line:
picard/build.xml:537: Compile failed; see the compiler error output for details.

I have tried to edit the JAVA6_HOME with /usr/lib/jvm/java-7-openjdk-amd64/lib/ and /usr/lib/jvm/java-6-openjdk-amd64/lib/. The same error persists.

Any help is appreciated.
Best regards,
Thiago

↧

Picard CreateSequenceDictionary shows ERROR: Option 'REFERENCE' is required.

March 15, 2019, 8:07 am

≫ Next: GenomicsDBImport terminates after Overlapping contigs found error

≪ Previous: Instalation - Picard

I am running picard on my university cluster which has picard/2.9.2 installed.

picard CreateSequenceDictionary \
R=newref_495.fa \
O=reference_495.dict

This is showing the error that:
ERROR: Option 'REFERENCE' is required.

I already have the option Reference. How can I solve this?

↧

GenomicsDBImport terminates after Overlapping contigs found error

October 15, 2018, 8:11 am

≫ Next: Running Haplotype Caller on Non-Model Organisms

≪ Previous: Picard CreateSequenceDictionary shows ERROR: Option 'REFERENCE' is required.

My original query was about batching and making intervals for GenomicsDBImport, but I have run into a new problem. I am using version 4.0.7.0 I tried the following:

gatk GenomicsDBImport \
--java-options "-Xmx250G -XX:+UseParallelGC -XX:ParallelGCThreads=24" \
-V input.list \
--genomicsdb-workspace-path 5sp_45ind_assmb_00 \
--intervals interval.00.list \
--batch-size 9

where I have split my list of contigs into 50 lists, and set batch size as 9 (instead of reading in 45 g.vcf at once) for a total of 5 batches. It looks like it has started to run, but terminated quickly after an error.

The resulting stack trace is:

00:53:23.869 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
00:53:23.869 INFO  GenomicsDBImport - Picard Version: 2.18.7
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:53:23.869 INFO  GenomicsDBImport - Deflater: IntelDeflater
00:53:23.869 INFO  GenomicsDBImport - Inflater: IntelInflater
00:53:23.869 INFO  GenomicsDBImport - GCS max retries/reopens: 20
00:53:23.869 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
00:53:23.869 INFO  GenomicsDBImport - Initializing engine
01:26:13.490 INFO  IntervalArgumentCollection - Processing 58057410 bp from intervals
01:26:13.517 INFO  GenomicsDBImport - Done initializing engine
Created workspace /home/leq/gvcfs/5sp_45ind_assmb_00
01:26:13.655 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb_00/vidmap.json
01:26:13.655 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb_00/callset.json
01:26:13.655 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb_00/vcfheader.vcf
01:26:13.655 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb_00/genomicsdb_array
01:26:13.656 INFO  ProgressMeter - Starting traversal
01:26:13.656 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
01:33:16.970 INFO  GenomicsDBImport - Importing batch 1 with 9 samples
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970] terminate called after throwing an instance of 'ProtoBufBasedVidMapperException' what():  
ProtoBufBasedVidMapperException : Overlapping contigs found

How do I overcome this issue of 'overlapping contigs found'? Is there a problem with my set of contigs? Also, is the warning about protocol messages something to worry about?

Thank you!

↧

Running Haplotype Caller on Non-Model Organisms

March 15, 2019, 10:47 am

≫ Next: Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR

≪ Previous: GenomicsDBImport terminates after Overlapping contigs found error

Hey GATK team,
I appreciate any help with this problem. I'm currently trying to call SNPS on a set of bam files that I have from a population of individuals. Each bam file has been preprocessed and aligned to a draft genome that we have in our lab. However, that draft genome has a fairly large number of scaffolds (~3k), and I have 20 individuals which I am trying to combine for this particular analysis.

I've spent some time reading docs and forum posts, and it seems like the current recommendation is to use HaplotypeCaller (GATK/4.0.7.0) to create gVCF's for each individual separately, and then to merge them. This operation is not able to make use of multiple CPU cores, except when a scatter-gather strategy is used.

Here's what I've done to call these 20 bam files individually; each job (Sun Grid Engine parallelized) has 1 compute core and 8GB of RAM:

```
lsi=($(find [bam path] -name '*bam'))

gatk HaplotypeCaller \
-R [genome path] \
-I ${lsi[SGE_TASK_ID]} \
-O ${lsi[SGE_TASK_ID]%.*}.g.vcf.gz \
-ERC GVCF \
--verbosity ERROR \
```

However, this operation has been running for about 40 hours now, and is not near to completion. I'm wondering if you have suggestions for how to speed this up? I've seen forum posts that suggest running individual Chromosomes to save computation time per-instance; but in my case this would entail breaking this analysis up into 60,000 chunks (1 for each scaffold, times 20), which seems unwieldy. I've also seen some information about the possibility of using SPARK, but when I try to use that tool on my cluster it gives me a warning message saying that the SPARK implementation isn't complete, and that it might make spurious calls.

Do you have any advice on how to proceed for this analysis?

Thanks,
Evan

↧

Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR

February 19, 2019, 8:02 am

≫ Next: Mutect2 allele specific stats for Multiallelic sites

≪ Previous: Running Haplotype Caller on Non-Model Organisms

Hi,

I am building a workflow for discovery of somatic snvs + indels that is pretty much the Broad's Best Practice but incorporating MarkDuplicatesSpark and a couple of other minor changes. Today I was running a normal-tumor pair of samples from WES experiments in GCP, and everything was going great until the workflow failed during Mutect2. In one of the shards (I am scattering the M2 step through 12 splits of the exome bedfile) I got this error:

    13:53:46.994 INFO  ProgressMeter -       chr19:18926479             20.2                 22440           1112.1
    13:53:51.138 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.589863008
    13:53:51.145 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 415.78724766500005
    13:53:51.147 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 82.56 sec
    13:53:52.161 INFO  Mutect2 - Shutting down engine
    [February 19, 2019 1:53:52 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 20.68 minutes.
    Runtime.totalMemory()=1132453888
    java.lang.IllegalArgumentException: Attempting to add a read to ActiveRegion out of order w.r.t. other reads: lastRead SRR3270880.37535587 chr19:19227104-19227253 at 19227104 attempting to add SRR3270880.23592400 chr19:19226999-19227148 at 19226999
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
        at org.broadinstitute.hellbender.engine.AssemblyRegion.add(AssemblyRegion.java:338)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.fillNextAssemblyRegionWithReads(AssemblyRegionIterator.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:194)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar

The other 11 shards finished without errors and produced the expected output.

I checked the bam from the tumor sample and indeed the read mentioned in the error is out of order. It is the second read from the end in the following snippet (pasting here only the first 9 columns from the bam file):

    SRR3270880.37535587 163 chr19   19227104    60  150M    =   19227395    441
    SRR3270880.46694860 147 chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.60287639 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.68448188 83  chr19   19227106    60  150M    =   19226611    -645
    SRR3270880.70212050 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358

The read does not have any bad quality flags and it appears twice in the bam, being in the correct order in its first occurrence (second read in the following snippet):

    SRR3270880.61849825 147 chr19   19226995    60  150M    =   19226895    -250
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358
    SRR3270880.47062210 147 chr19   19227001    60  150M    =   19226625    -526

The workflow does not include SortSam after MarkDuplicatesSpark as MDSpark's output is supposed to be coordinate sorted. From the bam's header: @HD VN:1.6 GO:none SO:coordinate

Previous to Mutect2, BaseRecalibrator-GatherBqsrReport-ApplyBQSR-GatherBamFiles (non-Spark versions) finished without any errors. These steps are also scattered through interval splits of the exome bedfile.

Strikingly, the start-end positions of this out of order read span from the last interval of interval split 6 to the first interval of interval split 7. Maybe the read was included in two contiguous splits of the bam file at the same time and that is why it appears twice in the bam file after the merge done by GatherBamFiles. (Last interval from split 6: chr19 19226311 19227116 ; first interval from interval split 7 : chr19 19227145 19228774 )

Intervals in my workflow are split by "SplitIntervals" tool (gatk4.1.0.0). I am currently including the argument --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION and feel that this could have to do with the error...

Any ideas of how this issue can be solved?

Thank you in advance

↧

Mutect2 allele specific stats for Multiallelic sites

March 6, 2019, 2:17 am

≫ Next: HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

≪ Previous: Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR

Hello,

I want to have allele specific stats for multi allelic sites. I was able to have this information by using other somatic variant callers, but I couldn’t get that info from neither mutect2 nor mutect1.

If you would give one possible way either mutect (I think you aren’t supporting right now) or mutect2 I would appreciate.

Thanks in advance, Gufran

↧

HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

March 7, 2019, 4:56 pm

≫ Next: GetBayesianHetCoverage HeterogeneousHeterozygousPileupPriorModel

≪ Previous: Mutect2 allele specific stats for Multiallelic sites

Hi there,
I'm running HaplotypeCaller in version 4.0.10.0 using your cromwell pipeline which puts this command information in the VCF header:

HaplotypeCaller --contamination-fraction-to-filter 0.0 --output GM12878_DNA_dup.bam.p4.bam.dup.bam_fixed.vcf.gz --intervals /projects/rcorbettprj2/clinicalGermline/GIABtests/bothReplicates/merges/varian
t_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCaller/shard-35/input
s/1202757533/0035-scattered.intervals --input /projects/rcorbettprj2/clinicalGermline/GIABtests/bothReplicates/merges/variant_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK
4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCaller/shard-35/inputs/-1421299472/GM12878_DNA_dup.bam.p4.bam.dup.bam_fixed.bam --reference /projects/rcorbettprj2/clinicalGermline/GIABtests/bothRepli
cates/merges/variant_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCa
ller/shard-35/inputs/-1226353055/hg19a.fa --emit-ref-confidence NONE --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --g
vcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-
bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 3
7 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvc
f-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-ba
nds 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --max-mnp-distance 0 --dont-trim-active-regions false --max-disc-ar-extension 25 --max-
gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-reco
ver-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observati
ons-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-
rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error
-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity
-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --output-mode EMIT_V
ARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UN
ION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-
bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-cac
hing false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-f
ilters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.0.10.0,Date="March 6, 2019 5:54:45 PM PST

In the resulting VCF I am seeing a number of records like the following:

1 1453665 . T . 20.80 . AN=2;DP=11;MQ=55.12 GT:AD:DP 0/0:11:11
1 1453666 . A . 20.80 . AN=2;DP=11;MQ=55.12 GT:AD:DP 0/0:11:11
1 1453676 . A . 14.91 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 1453682 . A . 14.91 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 1453696 . A . 12.05 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 200783760 . AAAC . 384.73 . AN=2;DP=11;MQ=60.00 GT:AD:DP 0/0:11:11

My initial interpretation of these is that although my VCF file should contain only variants that GATK believes to be real variants (ie. different from the reference), there are some variants that have been included and then marked as homozygous reference by some part of the pipeline.

My concern is that some casually developed tools would see the records are not filtered and happily use the locations even though the GT field suggests otherwise. Do you suggest running a second command to filter these records out, or perhaps most tools are correctly just filtering these out at runtime. Do we know what SNPEff or Annovar will do with these records?

thanks,
RIchard

↧

GetBayesianHetCoverage HeterogeneousHeterozygousPileupPriorModel

March 15, 2019, 1:43 pm

≫ Next: The GATK Best Practices for variant calling on RNAseq, in full detail

≪ Previous: HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

The GATK4 help page for GetBayesianHetCoverage refers the reader to something called the "HeterogeneousHeterozygousPileupPriorModel" for TUMOR_ONLY processing.

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.4/org_broadinstitute_hellbender_tools_exome_GetBayesianHetCoverage.php

Is there a description of the HeterogeneousHeterozygousPileupPriorModel?

Thanks,

Chip

↧

The GATK Best Practices for variant calling on RNAseq, in full detail

March 5, 2014, 11:24 pm

≫ Next: How are the unmapped.bam files created ?

≪ Previous: GetBayesianHetCoverage HeterogeneousHeterozygousPileupPriorModel

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

↧

How are the unmapped.bam files created ?

March 15, 2019, 6:54 pm

≫ Next: Java FileNotFoundError while executing gatk4

≪ Previous: The GATK Best Practices for variant calling on RNAseq, in full detail

Following GATK best practices (5$ pipeline/broad-prod-wgs-germline). Could someone please tell me:
1. What kind of "source" files (BCL/FASTQ/BAM), are used to CREATE the unmapped.bam files ?
2. If possible, where can I find these "source" files. ?

↧