GenotypeGVCF possible on just one chromosome?

June 20, 2019, 11:59 pm

≪ Previous: FilterMutectCalls fails on some samples using gatk-4.1.2.0

I have a batch of g.vcf-files that I want to merge to one common vcf, but have a problem with error msg (one of these "please do NOT post-errors; "the provided VCF file is malformed at approximately line number..." ). I am not able to find the error. (standard "best practice"-GenotypeGVCF script that have worked a number of times in smaller groups of files). If problems-the program does not always tell me which file is the problem - but it indicate this time that the problem is on X - which is analyzed in the end of the run. This means that I have to wait ~week to get the error message. (I am using GATK 3.8, and Java/jdkl.8.0_112)

So I have been testing smaller groups of files to find the problem file - but It would be much faster to check the most likely chromosome only? Possible?
Would making a vcf from a single g.vcf detect the all same errors as genotyping groups of files?

↧

GVCFblocks

June 21, 2019, 12:31 am

≫ Next: GenimicDBImport too slow!!!

≪ Previous: GenotypeGVCF possible on just one chromosome?

Hi gatkers!

I'm currently working with gvcf files, and there are some aspects that I dont better understand.
- In a non-variant block, do all the positions from POS to END have the same REF allele, for example , in a block of thousands of positions which an A as REF allele, are there thousands of adenines?
- In the gvcf file, will i have always all genome positions or could i filter it using a bed file/intervals of interest?

Any help is very wellcome.
Thank you very much!

Marta.

↧

GenimicDBImport too slow!!!

June 21, 2019, 1:53 am

≫ Next: GenomicDBImport too slow!!!!!

≪ Previous: GVCFblocks

Dear all,
I'm runnig GenomicDBImport for 30 samples. It takes soo much time and after 3 days job killed for walltime exceeded limit. I want to ask you If there is a way to let it become faster.
I thought to do a cycle of different jobs for each chromosome ... is it possibile? And..in that case, dbimport should create different folder for each chromosome? And and that point GenotypeGVCF is able to open all the different chromosomes folder or should I merge them?
Thanks!!!
Alex

↧

GenomicDBImport too slow!!!!!

June 21, 2019, 4:32 am

≫ Next: GBS data and duplicates marking

≪ Previous: GenimicDBImport too slow!!!

↧

GBS data and duplicates marking

May 26, 2016, 4:12 am

≫ Next: GenomicsDBImport run slowly with multiple samples

≪ Previous: GenomicDBImport too slow!!!!!

Hi, I just read this in the Best Practices guide:

"Duplicate marking should NOT be applied to amplicon sequencing data or other data types where reads start and stop at the same positions by design."

That's somehow what happens with GBS data since the sequenced fragments are obtained by cutting at the same place across the genome with a restriction enzyme for all the samples, have I been wrong when dedupping my data before the genotyping? Many thanks!

↧

GenomicsDBImport run slowly with multiple samples

July 17, 2018, 2:37 am

≫ Next: GenomicsDBImport too slow on local server

≪ Previous: GBS data and duplicates marking

Since I took the GATK4 training early this year, I switched my resequencing analysis from GATK 3.6 to GATK 4.0. Everything worked fine except the GVCF consolidate step. GenomicsDBImport takes much longer time than the traditional CombineGVCFs. On a pilot run of three samples, it took 40 hours while CombineGVCFs only took one hour.
Now I am expanding to 140 samples. I gave GenomicsDBImport a test run on 1 Mb on one chromosome, it took 5 hours, while CombineGVCFs took 10 minutes. I realized after a few runs that the running time is linear to my sample size and interval size. Therefore, I tried out a few parameters as suggested, including,
1. simply run gatk --java-options "-Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport --genomicsdb-workspace-path my_db --intervals chr2:1-10000000 -R ref.fa -V gvcf.list.
2. use --batch-size 10, which took 5% longer than the first solution
3. use --batch-size 10 --reader-threads 5, no difference from solution 2
4. use --L intervals.list, which split 10 Mb into 10 intervals but spent the same time as solution 2
It turns out that GenomicsDBImport without any other parameters runs the fastest, although it's still too long. I've used CombineGVCFs to generate a consolidate GVCF file for genotypeGVCFs, although I noticed that GenomicsDBImport and CombineGVCFs resulted different variants. I would like to compare the variants side by side.

So my questions are,
1. What is the criteria to split one chromosome into a few dozens of intervals without causing potential problems? For example, if a gene is split into two intervals, does it affect the following steps?
2. What is the following procedure after GenomicsDBImport is done with consecutive intervals because GenotypeGVCFs only runs on individual database? I’m testing it now so I will have an answer this week.
After all, I didn’t expect that GenomicsDBImport runs so slowly because I read CombineGVCFs is slower in the GATK4 forum. I would appreciate if its slowness is fixed. Otherwise, I have to split chromosomes into small intervals.

↧

GenomicsDBImport too slow on local server

February 12, 2018, 11:55 pm

≫ Next: Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"

≪ Previous: GenomicsDBImport run slowly with multiple samples

Hi,

I tried using GenomicsDBImport for our data. In my testcase I tried importing Chromosome 1 for 223 samples. Since most samples are panels and we have only a few genomes and exomes, I thought it would be best to always call anything together.
My commandline:

opt/gatk/4.0.0.0/gatk --java-options "-Xmx8G -Xms8G" GenomicsDBImport
 --sample-name-map[...]/all_samples.sample_map 
--genomicsdb-workspace-path [...]/germline_snp_database_1
 --batch-size 50 
-L NC_000001 
--reader-threads 5

I only use 5 reader threads because I plan on parallelizing with scatter gather later on. The command is running since 14 hours on a local server. Is there something wrong, or something I can do to mae it reasonable fast? So far the GATK 3.8 pipeline is way faster.

Thanks & best regards,
Daniel

↧

Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"

July 26, 2017, 1:46 am

≫ Next: Best germline resource to use with latest gakt4 Mutect2?

≪ Previous: GenomicsDBImport too slow on local server

Hello,

I am running genotypeGVCFs with ~4000 human exome data. To speed up the process, I have splited exome.interval_list into sub_interval_list which one interval file contains ~100kb regions. Then I submitted the genotypeGVCFs jobs in parallel for each sub_interval_list. e.g.

java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000000.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000000.vcf

java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000001.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000001.vcf

...

However, I kept receiving "ProgressMeter - Starting" for hours without any variants outputed.

INFO  00:09:31,580 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  00:09:31,581 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  00:09:32,292 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi
WARN  00:09:32,295 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi
INFO  00:09:32,295 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant f
INFO  00:10:01,605 ProgressMeter -        Starting         0.0    30.0 s      49.6 w      100.0%    30.0 s       0.0 s
INFO  00:10:31,606 ProgressMeter -        Starting         0.0    60.0 s      99.2 w      100.0%    60.0 s       0.0 s
INFO  00:11:01,608 ProgressMeter -        Starting         0.0    90.0 s     148.9 w      100.0%    90.0 s       0.0 s
INFO  00:11:31,611 ProgressMeter -        Starting         0.0   120.0 s     198.5 w      100.0%   120.0 s       0.0 s
INFO  00:12:01,613 ProgressMeter -        Starting         0.0     2.5 m     248.1 w      100.0%     2.5 m       0.0 s

I have read this thread and noticed this happens for reference genome with millions of contigs. But my data is human with much fewer contigs, so I would not think they are the same cases.

I know WDL/cromwell can support scatter/gather method to speed up. However, as I understand, the principle of the scatter/gather is the same as what I did here. So even using WDL, the parallelizable jobs are still facing the same stuck situation. Is that right?

Is there anything else I can do to get this to run at all, or faster, or just wait?

Thanks a lot for any inputs!

Best,
Jinjie

↧

Best germline resource to use with latest gakt4 Mutect2?

June 21, 2019, 7:17 am

≫ Next: Latest Mutect2 (4.1.1.0) tutorial missing CalculateContamination ?

≪ Previous: Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"

In the latest mutect2 tutorial

https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-using-gatk4-mutect2#latest

there's mention of using af-only-gnomad.vcf file for the germline resource, but I didn't find a link to the resource to use.

If I google it, I find this:
http://bioinfo5pilm46.mit.edu/software/GATK/resources/
and
http://hgdownload.cse.ucsc.edu/gbdb/hg19/gnomAD/vcf/

If I go to the gatk wdl pipeline for the current mutect2 workflow:
https://github.com/broadinstitute/gatk/tree/master/scripts/mutect2_wdl

it points to
https://gnomad.broadinstitute.org/downloads

Can you add a link to the exact resource file that should be used from the tutorial page?

↧

Latest Mutect2 (4.1.1.0) tutorial missing CalculateContamination ?

June 21, 2019, 7:07 am

≫ Next: A way to come up with "truth set" to use VQSR

≪ Previous: Best germline resource to use with latest gakt4 Mutect2?

I was going through the latest Mutect2 tutorial:

https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-using-gatk4-mutect2#latest

and it doesn't seem to include the CalculateContamination step. Is it assumed that this should still be run as in the earlier (deprecated) Mutect2 tutorial provided here?

https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2#1

↧

A way to come up with "truth set" to use VQSR

June 21, 2019, 8:55 am

≫ Next: Indel shown in IGV not called by GATK 4.0

≪ Previous: Latest Mutect2 (4.1.1.0) tutorial missing CalculateContamination ?

Dear GATK Team,

I have a question regarding finding cutoffs for hard filtering. I am working with yeast for which we do not have a good true variation set. I am following the best practices and have done the joint genotyping of my samples. To give some idea, my samples are yeast clones isolated from a population at different time points. I was wondering if I can select a subset of variants which are shared amongst more than 2 samples (and thus, more likely to be correct) to use as my "truth set", and thus, use VQSR pipeline instead. Am I doing something obviously wrong with this approach?

Thank you,
Ramya

↧

Indel shown in IGV not called by GATK 4.0

June 21, 2019, 9:20 am

≫ Next: Construct genome with gvcf or vcf file?

≪ Previous: A way to come up with "truth set" to use VQSR

The code I used to produce g.vcf file is

gatk HaplotypeCaller -R /work7_P1/GATK_RegionCall/IRGSP-1.0/IRGSP-1.0_genome.fasta -I /work7_P1/GATK_RegionCall/Bam_SortIndex/CX227.bam -O /work7_P1/GATK_RegionCall/Variant_call/CX227.g.vcf -L chr01 -ERC GVCF --max-alternate-alleles 25

In the g.vcf file, site 4886943 is marked as reference allele

chr01   4886940 .       G       <NON_REF>       .       .       END=4886942     GT:DP:GQ:MIN_DP:PL      0/0:19:39:19:0,39,585
chr01   4886943 .       T       <NON_REF>       .       .       END=4886943     GT:DP:GQ:MIN_DP:PL      0/0:18:18:18:0,18,270
chr01   4886944 .       A       <NON_REF>       .       .       END=4886951     GT:DP:GQ:MIN_DP:PL      0/0:15:0:13:0,0,93
chr01   4886952 .       A       <NON_REF>       .       .       END=4886952     GT:DP:GQ:MIN_DP:PL      0/0:15:10:15:0,10,515

However, when I checked my bam file, I can see an 8bp deletion, which is correct(or which has the most reads supported)

Then I check my final vcf file and it is also marked as reference allele supported by 18 reads

chr01   4886943 .       TAGAGAGAGAG     TAG,TAGAGAGAGAGAGAGAG,TAGAGAG,TAGAG,TAGAGAGAGAGAGAG,T   279066.38       .       AC=963,65,1008,117,62,44;AF=0.167,0.011,0.175,0.020,0.011,7.644e-03;AN=5756;BaseQRankSum=-3.190e-01;DP=23753;ExcessHet=-0.0000;FS=8.590;InbreedingCoeff=0.8526;MQ=58.21;MQRankSum=0.00;QD=27.26;ReadPosRankSum=0.00;SOR=2.043   GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:18,0,0,0,0,0,0:18:18:.:.:0,18,270,18,270,270,18,270,270,270,18,270,270,270,270,18,270,270,270,270,270,18,270,270,270,270,270,270

Why can't GATK recognize this kind of Indel or provide any evidence about the indel?

↧

Construct genome with gvcf or vcf file?

February 26, 2019, 2:11 pm

≫ Next: SAC annotation

≪ Previous: Indel shown in IGV not called by GATK 4.0

I have 100 rice accessions and their mapping files from DNA sequencing. My goal is to construct the genome for a specific region for each rice accession. Could I just use the gvcf file after initial varaint calling? If so, how could I guarantee the quality for those SNP/InDel sites? Or must I use the vcf file after final genotype call?

↧

SAC annotation

June 21, 2019, 9:25 am

≫ Next: Alternate way to install GATK4, without using conda or the docker?

≪ Previous: Construct genome with gvcf or vcf file?

Hi,

I would like to know if there is a way to include SAC (StrandAlleleCountsBySample) annotation in my vcf files?

I know such annotation is only available in HC. But I want to know if there is a way to run HC again only to include such SAC annotation.

I'm asking this because HC took too much time and memory to run on my 91 samples.

Sorry if this question is repetitive or too basic, but I couldn't find the answer in another question on forum or in tutorials.

Thank you so much

↧

Alternate way to install GATK4, without using conda or the docker?

June 21, 2019, 11:32 am

≫ Next: Cannot download bundle files from the FTP server

≪ Previous: SAC annotation

I've run GATK4's CNV Detection Utilities on the docker through my local system before. I'm now trying to run it on my server. I followed the instructions on the '(How to) Install and use Conda for GATK4' page but unfortunately my server neither supports docker images nor conda installations.

Is there an another way I can get it running with all dependencies?-
- I already have GATK 4.1.2.0 installed on the server, but I can't run DetermineGermlineContigPloidy, GermlineCNVCaller and ostprocessGermlineCNVCalls without gcnvkernel installed.

- In particular I'm trying to get GATK with gcnvkernel installed

↧

Cannot download bundle files from the FTP server

June 21, 2019, 11:45 am

≫ Next: GATK 4 Determine Contig Ploidy Error - Sample "unknown" already has coverage metadata annotations

≪ Previous: Alternate way to install GATK4, without using conda or the docker?

Hi,

I'm trying to download (I'm using a Mac) the bundle files (specifically, the "hg19" folder in the bundle), but without any success. I constantly got an error message like this:

The Finder can’t complete the operation because some data in “dbsnp_138.hg19.excluding_sites_after_129.vcf.gz” can’t be read or written.
(Error code -36)

Could anyone please help me with this?

Thank you very much!

↧

GATK 4 Determine Contig Ploidy Error - Sample "unknown" already has coverage metadata annotations

June 21, 2019, 2:02 pm

≫ Next: Pedigree / PED files

≪ Previous: Cannot download bundle files from the FTP server

Hi,

I am running GATK 4.1 for Germline CNV on 200+ WES samples and came across the error below while running the Dtermine Contig Ploidy step. Any help would be appreciated.

15:52:07.627 DEBUG ScriptExecutor - Executing:
15:52:07.627 DEBUG ScriptExecutor - python
15:52:07.627 DEBUG ScriptExecutor - /tmp/cohort_determine_ploidy_and_depth.5271158842796165046.py
15:52:07.628 DEBUG ScriptExecutor - --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig1630545224704572068.tsv
15:52:07.628 DEBUG ScriptExecutor - --output_calls_path=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/ploidy-calls
15:52:07.628 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
15:52:07.628 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
15:52:07.628 DEBUG ScriptExecutor - --mean_bias_sd=1.000000e-02
15:52:07.628 DEBUG ScriptExecutor - --psi_j_scale=1.000000e-03
15:52:07.628 DEBUG ScriptExecutor - --learning_rate=5.000000e-02
15:52:07.628 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
15:52:07.628 DEBUG ScriptExecutor - --adamax_beta2=9.990000e-01
15:52:07.628 DEBUG ScriptExecutor - --log_emission_samples_per_round=2000
15:52:07.628 DEBUG ScriptExecutor - --log_emission_sampling_rounds=100
15:52:07.628 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-04
15:52:07.628 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=1000
15:52:07.628 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=1000
15:52:07.628 DEBUG ScriptExecutor - --min_training_epochs=20
15:52:07.628 DEBUG ScriptExecutor - --max_training_epochs=100
15:52:07.628 DEBUG ScriptExecutor - --initial_temperature=2.000000e+00
15:52:07.628 DEBUG ScriptExecutor - --num_thermal_advi_iters=5000
15:52:07.628 DEBUG ScriptExecutor - --convergence_snr_averaging_window=5000
15:52:07.628 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
15:52:07.628 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
15:52:07.628 DEBUG ScriptExecutor - --max_calling_iters=1
15:52:07.628 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
15:52:07.628 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
15:52:07.628 DEBUG ScriptExecutor - --caller_external_admixing_rate=7.500000e-01
15:52:07.628 DEBUG ScriptExecutor - --disable_caller=false
15:52:07.628 DEBUG ScriptExecutor - --disable_sampler=false
15:52:07.628 DEBUG ScriptExecutor - --disable_annealing=false
15:52:07.628 DEBUG ScriptExecutor - --interval_list=/tmp/intervals5347690040069421909.tsv
15:52:07.628 DEBUG ScriptExecutor - --contig_ploidy_prior_table=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/contigPloidyPriorsTable2.tsv
15:52:07.628 DEBUG ScriptExecutor - --output_model_path=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/ploidy-model
Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.5271158842796165046.py", line 87, in
sample_metadata_collection, args.sample_coverage_metadata)
File "/usr/miniconda3/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_metadata.py", line 78, in read_sample_coverage_metadata
sample_name, n_j, contig_list))
File "/usr/miniconda3/envs/gatk/lib/python3.6/site-packages/gcnvkernel/structs/metadata.py", line 227, in add_sample_coverage_metadata
'Sample "{0}" already has coverage metadata annotations'.format(sample_name))
gcnvkernel.structs.metadata.SampleAlreadyInCollectionException: Sample "unknown" already has coverage metadata annotations
15:52:16.249 DEBUG ScriptExecutor - Result: 1
15:52:16.250 INFO DetermineGermlineContigPloidy - Shutting down engine
[June 21, 2019 3:52:16 PM CDT] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 2.55 minutes.
Runtime.totalMemory()=5929172992
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_determine_ploidy_and_depth.5271158842796165046.py --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig1630545224704572068.tsv --output_calls_path=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/ploidy-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --mean_bias_sd=1.000000e-02 --psi_j_scale=1.000000e-03 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --interval_list=/tmp/intervals5347690040069421909.tsv --contig_ploidy_prior_table=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/contigPloidyPriorsTable2.tsv --output_model_path=/mnt/data/smb_share/Mandal_project/PCa_WES/mcdowell#Bailey-Wilson_NF_AAPC/SampleBams/ploidy-model
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:411)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:288)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

Thanks,

Tarun

↧

Pedigree / PED files

May 27, 2016, 3:53 pm

≫ Next: Picard LiftoverVcf Duplicate allele added to VariantContext

≪ Previous: GATK 4 Determine Contig Ploidy Error - Sample "unknown" already has coverage metadata annotations

A pedigree is a structured description of the familial relationships between samples.

Some GATK tools are capable of incorporating pedigree information in the analysis they perform if provided in the form of a PED file through the --pedigree (or -ped) argument.

PED file format

PED files are tabular text files describing meta-data about the samples. See http://www.broadinstitute.org/mpg/tagger/faq.html and http://zzz.bwh.harvard.edu/plink/data.shtml#ped for more information.

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. If an individual's sex is unknown, then any character other than 1 or 2 can be used in the fifth column.

A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an "affected status" column: GATK will automatically detect which type (i.e. based on whether a value other than 0, 1, 2 or the missing genotype code is observed).

Affected status should be coded as follows:

-9 missing
0 missing
1 unaffected
2 affected

If any value outside of -9,0,1,2 is detected, then the samples are assumed to have phenotype values, interpreted as string phenotype values.

Note that genotypes (column 7 onwards) cannot be specified to the GATK.

You can add a comment to a PED or MAP file by starting the line with a # character. The rest of that line will be ignored, so make sure none of the IDs start with this character.

Each -ped argument can be tagged with NO_FAMILY_ID, NO_PARENTS, NO_SEX, NO_PHENOTYPE to tell the GATK PED parser that the corresponding fields are missing from the ped file.

Example

Here are two individuals (one row = one person):

FAM001  1  0 0  1  2
FAM001  2  0 0  1  2

↧

Picard LiftoverVcf Duplicate allele added to VariantContext

June 21, 2019, 10:58 pm

≫ Next: Picard FindMendelianViolations: "Malformed header" error when specifying output directory

≪ Previous: Pedigree / PED files

Hi,
I've run into a frustrating problem: when lifting over certain VCFs from b37 to hg38, I'm running into LiftoverVcf exiting with errors like Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext

I've got it isolated to an example failing variant, but I'm at a loss for how to fix or prevent this error, since they seem scattered among VCFs I've generated with GATK 3.8-1-0-gf15c1c3ef GenotypeGVCFs.

Picard version 2.20.2

$ java -jar picard.jar LiftoverVcf --version
2.20.2-SNAPSHOT

Failing variant example:

$ java -jar picard.jar LiftoverVcf C=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz I=bad_simple.vcf O=test.vcf R=hg38.fa REJECT=test_b37.reject.vcf
INFO    2019-06-22 00:35:10 LiftoverVcf 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    LiftoverVcf -C /gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz -I bad_simple.vcf -O test.vcf -R hg38.fa -REJECT test_b37.reject.vcf
**********


00:35:10.862 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs1/home/jlawlor/test_liftover/round_4/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jun 22 00:35:10 CDT 2019] LiftoverVcf INPUT=bad_simple.vcf OUTPUT=test.vcf CHAIN=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz REJECT=test_b37.reject.vcf REFERENCE_SEQUENCE=hg38.fa    WARN_ON_MISSING_CONTIG=false LOG_FAILED_INTERVALS=true WRITE_ORIGINAL_POSITION=false WRITE_ORIGINAL_ALLELES=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false RECOVER_SWAPPED_REF_ALT=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] DISABLE_SORT=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Jun 22 00:35:10 CDT 2019] Executing as jlawlor@hpc0005 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_102-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-22 00:35:11 LiftoverVcf Loading up the target reference genome.
INFO    2019-06-22 00:35:21 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
[Sat Jun 22 00:35:21 CDT 2019] picard.vcf.LiftoverVcf done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=6996623360
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
    at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
    at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:379)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
    at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:426)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

with VCF bad_simple.vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##DRAGENCommandLine=<ID=dragen,Version="SW: 01.011.269.3.2.8, HW: 01.011.269",Date="Tue May 21 12:16:34 CDT 2019",CommandLineOptions="-f -r /staging/reference/GRCh37/GRCh37.fa.k_21.f_16.m_149 --fastq-list /staging/fastq/SL385519_fastqs/SL385519_list.csv --output-directory /staging/bam/ --output-file-prefix SL385519 --enable-duplicate-marking true --enable-map-align-output true --enable-variant-caller true --vc-sample-name SL385519 --vc-emit-ref-confidence GVCF --dbsnp /staging/reference/GRCh37/dbsnp_135.b37.vcf">
##FILTER=<ID=DRAGENHardQUAL,Description="Set if true:QUAL < 10.4139">
##FILTER=<ID=LowDepth,Description="Set if true:DP < 1">
##FILTER=<ID=LowGQ,Description="Set if true:GQ = 0">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
##FILTER=<ID=VQSRTrancheINDEL99.00to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -12.1756 <= x < -1.3496">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -1409.7427">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1409.7427 <= x < -12.1756">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -4.6589 <= x < 0.343">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -39592.3492">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -39592.3492 <= x < -4.6589">
##FILTER=<ID=lod_fstar,Description="Variant does not meet likelihood threshold (default threshold is 6.3)">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allelic frequency for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Normalized likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=ICNT,Number=2,Type=Integer,Description="Counts of INDEL informative reads based on the reference confidence model">
##FORMAT=<ID=LOD,Number=1,Type=Float,Description="Per-sample variant LOD score">
##FORMAT=<ID=MB,Number=4,Type=Integer,Description="Per-sample component statistics to detect mate bias">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PP,Number=G,Type=Integer,Description="Phred-scaled posterior genotype probabilities">
##FORMAT=<ID=PRI,Number=G,Type=Float,Description="Phred-scaled prior probabilities for genotypes">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">
##FORMAT=<ID=SPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for SNPs based on the reference confidence model">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FGT,Number=0,Type=Flag,Description="ForceGT variant call">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=FractionInformativeReads,Number=1,Type=Float,Description="The fraction of informative reads out of the total reads">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=LOD,Number=1,Type=Float,Description="Variant LOD score">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=NML,Number=0,Type=Flag,Description="Normal (non-ForceGT) variant call">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=R2_5P_bias,Number=1,Type=Float,Description="Score based on mate bias and distance from 5 prime end">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##reference=file:///gpfs/gpfs1/myerslab/reference/genomes/bwa-0.7.8/GRCh37.fa
##bcftools_viewVersion=1.7+htslib-1.4.1
##bcftools_viewCommand=view -h batch_65.vcf.gz; Date=Fri Jun 21 23:56:28 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   143283417   .   ACCG    A,* 512.03  VQSRTrancheINDEL99.00to99.90    .

Successful variant example
It doesn't seem to be solely a problem with indels or * alternates, because this variant (from nearby) has no problems:

java -jar picard.jar LiftoverVcf C=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz I=good_simple.vcf O=test.vcf R=hg38.fa REJECT=test_b37.reject.vcf
INFO    2019-06-22 00:34:30 LiftoverVcf 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    LiftoverVcf -C /gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz -I good_simple.vcf -O test.vcf -R hg38.fa -REJECT test_b37.reject.vcf
**********


00:34:30.734 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs1/home/jlawlor/test_liftover/round_4/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jun 22 00:34:30 CDT 2019] LiftoverVcf INPUT=good_simple.vcf OUTPUT=test.vcf CHAIN=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz REJECT=test_b37.reject.vcf REFERENCE_SEQUENCE=hg38.fa    WARN_ON_MISSING_CONTIG=false LOG_FAILED_INTERVALS=true WRITE_ORIGINAL_POSITION=false WRITE_ORIGINAL_ALLELES=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false RECOVER_SWAPPED_REF_ALT=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] DISABLE_SORT=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Jun 22 00:34:30 CDT 2019] Executing as jlawlor@hpc0005 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_102-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-22 00:34:30 LiftoverVcf Loading up the target reference genome.
INFO    2019-06-22 00:34:41 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
INFO    2019-06-22 00:34:41 LiftoverVcf Processed 1 variants.
INFO    2019-06-22 00:34:41 LiftoverVcf 0 variants failed to liftover.
INFO    2019-06-22 00:34:41 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2019-06-22 00:34:41 LiftoverVcf 0.0000% of variants were not successfully lifted over and written to the output.
INFO    2019-06-22 00:34:41 LiftoverVcf liftover success by source contig:
INFO    2019-06-22 00:34:41 LiftoverVcf 1: 1 / 1 (100.0000%)
INFO    2019-06-22 00:34:41 LiftoverVcf lifted variants by target contig:
INFO    2019-06-22 00:34:41 LiftoverVcf chr21: 1
WARNING 2019-06-22 00:34:41 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered.  See RECOVER_SWAPPED_REF_ALT and associated caveats.
INFO    2019-06-22 00:34:41 LiftoverVcf Writing out sorted records to final VCF.
[Sat Jun 22 00:34:41 CDT 2019] picard.vcf.LiftoverVcf done. Elapsed time: 0.18 minutes.

with VCF good_simple.vcf (same header as previous example)

##bcftools_viewCommand=view -h 65_1.vcf; Date=Sat Jun 22 00:14:22 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   143283452   .   A   ACACG,* 275.1   VQSRTrancheINDEL99.00to99.90    .

Resources I'm using:
1. chain file from https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain
2. reference from ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
3. Sequence dictionaries from picard CreateSequenceDictionary

File provenance: GRCh37 GVCFs generated by DRAGEN variant caller in --vc-emit-ref-confidence GVCF mode, joint-genotyped with other samples with GATK 3.8-1-0-gf15c1c3ef

I've also tried:
1. Lifting over from b37 -> hg19 (successful) and then hg19 -> hg38 (same failure) using the chain files and hg19 reference from UCSC; all of the above using the reference from the GATK Resource Bundles.
2. Adjusting LIFTOVER_MIN_MATCH which results in no variants successfully mapping (preventing the java error)
3. Adjusting RECOVER_SWAPPED_REF which has no effect on this error
4. CrossMap (v. 3.4 runs into python errors; v. 3.3 has mapping problems with b37 when "chr" isn't used in chromosome names)

Any advice would be appreciated!
Thanks.

↧

Picard FindMendelianViolations: "Malformed header" error when specifying output directory

June 22, 2019, 8:41 am

≫ Next: How can i set cores that MarkDuplicatesSpark uses?

≪ Previous: Picard LiftoverVcf Duplicate allele added to VariantContext

Hi all

I am relatively new at NGS analysis and especially at using GATK. I am curently analyzing a small set of exome seq data from a small family (3 generations, 2 individuals per generation) and wanted check for mendelian errors using picard FindMendelianViolations (+filtering the variants for a minimum coverage of 30x to avoid false calls at sparsely covered intronic SNPs). The data was generated at the BGI on a HiSeq Ten X and processed using GATK (as far as i can extract from the VCF header)

The FindMendelianViolations program works fine when using the command

java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf/combine.snp.vcf.gz PED=../../../0_pedigree/trio.ped OUTPUT=mendelian_trio.DP30b.txt MIN_DP=30

However, when I add an output folder the tool first runs through the vcf, but then stops reporting the with the error:
"Your input file has a malformed header: BUG: VCF header has duplicate sample names". The error appears only when I specify an output folder (which appears quite weird to me), but I could reproduce the error several times. I could not figure out what exactly happens. The output folder remains empty, although it seems that the tool attempts to write a file named 1.vcf.

$ java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz PED=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30/
INFO    2019-06-21 20:30:14 FindMendelianViolations 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    FindMendelianViolations -I ../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz -PED ../../../0_pedigree/trio_nospaces.ped -OUTPUT mendelian_trio.DP30-2.txt -MIN_DP 30 -VCF_DIR vcf_violations30/
**********


20:30:15.252 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jun 21 20:30:15 CEST 2019] FindMendelianViolations INPUT=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz TRIOS=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30    MIN_GQ=30 MIN_HET_FRACTION=0.3 SKIP_CHROMS=[MT, chrM] MALE_CHROMS=[chrY, Y] FEMALE_CHROMS=[chrX, X] PSEUDO_AUTOSOMAL_REGIONS=[chrX:10000-2781479, X:10001-2649520, chrX:155701382-156030895, X:59034050-59373566] THREAD_COUNT=1 TAB_MODE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Jun 21 20:30:15 CEST 2019] Executing as q005sc@T450s on Linux 4.15.0-51-generic amd64; OpenJDK 64-Bit Server VM 11.0.3+7-Ubuntu-1ubuntu218.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-21 20:30:15 FindMendelianViolations Loading and filtering trios.
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
INFO    2019-06-21 20:30:16 FindMendelianViolations variants analyzed        10,000 records.  Elapsed time: 00:00:01s.  Time for last 10,000:    0s.  Last read position: chr1:62,594,480

[ ... omitted ... ]

INFO    2019-06-21 20:30:20 FindMendelianViolations variants analyzed       240,000 records.  Elapsed time: 00:00:05s.  Time for last 10,000:    0s.  Last read position: chr22:44,368,204
INFO    2019-06-21 20:30:20 FindMendelianViolations Writing family violation VCFs to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/
INFO    2019-06-21 20:30:20 FindMendelianViolations Writing 1 violation VCF to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/1.vcf
[Fri Jun 21 20:30:20 CEST 2019] picard.vcf.MendelianViolations.FindMendelianViolations done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=206569472
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
    at htsjdk.variant.vcf.VCFHeader.<init>(VCFHeader.java:142)
    at picard.vcf.MendelianViolations.FindMendelianViolations.writeAllViolations(FindMendelianViolations.java:288)
    at picard.vcf.MendelianViolations.FindMendelianViolations.doWork(FindMendelianViolations.java:262)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

However, the header seems fine to me (AXX to TXX are the six samples):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AXX01   EXX01   GXX01   NXX01   OXX01   TXX01

The input ped file looks like. It does not contain all samples because we are interested only in generation 2 and 3, but the error appears also when including all samples into the ped file:

1   OXX01   0  0  1  1
1   NXX01   0  0  2  0
1   TXX01   OXX01  NXX01  1  1
1   EXX01   0  0  2  0

The output of ValidateVariants is as follows (run from the docker image)

root@4362f1ecbb5a:/gatk# gatk --version
The Genome Analysis Toolkit (GATK) v4.1.2.0
HTSJDK Version: 2.19.0
Picard Version: 2.19.0

root@4362f1ecbb5a:/gatk# gatk ValidateVariants --variant combine.snp.reheader-out.vcf.gz 
Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.2.0-local.jar ValidateVariants --variant combine.snp.reheader-out.vcf.gz
15:16:38.147 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jun 22, 2019 3:16:39 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:16:39.895 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.896 INFO  ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.2.0
15:16:39.896 INFO  ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
15:16:39.896 INFO  ValidateVariants - Executing as root@4362f1ecbb5a on Linux v4.15.0-51-generic amd64
15:16:39.897 INFO  ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
15:16:39.897 INFO  ValidateVariants - Start Date/Time: June 22, 2019 3:16:38 PM UTC
15:16:39.897 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.897 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.897 INFO  ValidateVariants - HTSJDK Version: 2.19.0
15:16:39.897 INFO  ValidateVariants - Picard Version: 2.19.0
15:16:39.897 INFO  ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:16:39.898 INFO  ValidateVariants - Deflater: IntelDeflater
15:16:39.898 INFO  ValidateVariants - Inflater: IntelInflater
15:16:39.898 INFO  ValidateVariants - GCS max retries/reopens: 20
15:16:39.898 INFO  ValidateVariants - Requester pays: disabled
15:16:39.898 INFO  ValidateVariants - Initializing engine
15:16:40.150 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/combine.snp.reheader-out.vcf.gz
15:16:40.268 INFO  ValidateVariants - Done initializing engine
15:16:40.269 INFO  ProgressMeter - Starting traversal
15:16:40.269 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
15:16:41.899 INFO  ProgressMeter -       chrX:142605437              0.0                245817        9048478.5
15:16:41.900 INFO  ProgressMeter - Traversal complete. Processed 245817 total variants in 0.0 minutes.
15:16:41.900 INFO  ValidateVariants - Shutting down engine

I was not able to distill from the output above whether my vcf is ok or not. No report file was written to the directory (exectuted in /gatk)

I would be very grateful for any help to figure out what is happening! Thank you very much!

Stefan

↧