Hi,
I am interested in finding de novo mutations in human families. I have implemented the GATK best practices data preprocessing pipeline as well as the Germline short variant discovery workflow. I have done this using GATK4 and the hg38 reference genome. I plan on implementing the Genotype Refinement workflow for germline short variants (as this seems to be the GATK suggested way of identifying denovos) but I wanted to try out PhaseByTransmission before proceeding. To do this I needed to use GATK3 as PBT is not implemented in GATK4. At first I ran into a problem because my VCF contained variants from an entire family (5 individuals) and PBT can only run on a single trio at a time. So I wrote individual PED files for each child and ran PBT with the '--pedigreeValidationType SILENT' option. Here is what I ran for one of the children (I ran something very similar for the other 2):
java -jar GenomeAnalysisTK.jar \
-T PhaseByTransmission \
-R $ref_dir/Homo_sapiens_assembly38.fasta \
-V 2003_57_recalibrated_variants.vcf \
-ped 2003058.ped \
-o 2003058_pbt.vcf \
-mvf 2003058_mandelian_violations.vcf \
--pedigreeValidationType SILENT 2
and here is the contents of 2003058.ped:
2003_57 2003003 0 0 2 0
2003_57 2003057 0 0 1 0
2003_57 2003058 2003057 2003003 2 0
For each of these runs, PBT crashed after about 10 minutes. Here is the tail of the output:
INFO 15:05:18,853 ProgressMeter - chr13:108037823 8021595.0 9.0 m 67.0 s 67.9% 13.3 m 4.3 m
INFO 15:05:48,854 ProgressMeter - chr15:46251520 8472743.0 9.5 m 67.0 s 72.9% 13.0 m 3.5 m
INFO 15:06:18,855 ProgressMeter - chr16:71683502 8933283.0 10.0 m 67.0 s 76.8% 13.0 m 3.0 m
INFO 15:06:48,857 ProgressMeter - chr18:10072380 9399718.0 10.5 m 67.0 s 80.3% 13.1 m 2.6 m
INFO 15:07:18,859 ProgressMeter - chr19:45951638 9858096.0 11.0 m 66.0 s 83.9% 13.1 m 2.1 m
INFO 15:07:48,860 ProgressMeter - chr21:27088297 1.0311184E7 11.5 m 66.0 s 87.2% 13.2 m 101.0 s
##### ERROR --
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: 2
at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAsMap(GenotypeLikelihoods.java:171)
at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.getLikelihoodsAsMapSafeNull(PhaseByTransmission.java:625)
at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.phaseTrioGenotypes(PhaseByTransmission.java:669)
at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:878)
at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:143)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: 2
##### ERROR ------------------------------------------------------------------------------------------
I saw threads on similar error messages, so I used ValidateVariants to make sure the VCF file produced by the GATK pipeline was OK. GATK4's ValidateVariants said ti was fine. However, GATK3 had the following output:
INFO 15:49:26,538 ValidateVariants - Reference allele is too long (133) at position chr2:3725401; skipping that record. Set --reference_window_stop >= 133
INFO 15:49:26,818 ValidateVariants - Reference allele is too long (111) at position chr2:8067973; skipping that record. Set --reference_window_stop >= 111
INFO 15:49:26,857 ValidateVariants - Reference allele is too long (120) at position chr2:8895476; skipping that record. Set --reference_window_stop >= 120
INFO 15:49:26,884 ValidateVariants - Reference allele is too long (113) at position chr2:9406449; skipping that record. Set --reference_window_stop >= 113
INFO 15:49:27,010 ValidateVariants - Reference allele is too long (113) at position chr2:10925438; skipping that record. Set --reference_window_stop >= 113
INFO 15:49:27,105 ValidateVariants - Reference allele is too long (108) at position chr2:12456149; skipping that record. Set --reference_window_stop >= 108
INFO 15:49:27,402 ValidateVariants - Reference allele is too long (187) at position chr2:17964404; skipping that record. Set --reference_window_stop >= 187
INFO 15:49:27,428 ValidateVariants - Reference allele is too long (122) at position chr2:18294631; skipping that record. Set --reference_window_stop >= 122
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: File /home/aschoenr/work/data/8_rvqs/2003_57/2003_57_recalibrated_variants.vcf fails strict validation: one or more of the ALT allele(s) for the record at position chr2:20357491 are not observed at all in the sample genotypes
##### ERROR ------------------------------------------------------------------------------------------
Does this mean that the VCF created by the GATK4 pipeline will not work with PhaseByTransmission? Like I said before, I plan on implementing the Genotype Refinement workflow, but I thought it would be nice to have the PBT output to compare to the Genotype Refinement workflow output.
Any help would be greatly appreciated!