Large Batch - "Unable to create BasicFeatureReader"

August 16, 2017, 5:08 am

≫ Next: How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

≪ Previous: GATK 3.7 error message: not a strict subset of per read allele map alleles

Hi,
I've been running into an error using GATK 3.8-0, " Unable to create BasicFeatureReader using feature file , for input source: " when I try to do a somewhat-large batch call of bgzipped and tabix indexed single-chromosome GVCFs. (Ultimately I need to batch call ~1000 samples together.) I can't find an intrinsic problem with the input files--I am able to batch-call the entire set in groups of ten or groups of 100, but if I try 200 files or more, I get this error. However, I don't understand why this particular error would be related to a lack of resources, and I can't otherwise tell from our job queuing system (LSF) that our resource limits were exceeded.

Any advice would be appreciated. Thank you!

I'm giving the jobs:
32 processors
~256GB RAM
Javascript heap size: 212GB
HTSLIB version 1.4.1

calling GATK with this command (after setting some ENV variables which I have excluded here):

CMD="java -Xmx212g -jar GenomeAnalysisTK.jar \
    -T GenotypeGVCFs \
    -R ${BWA_GENOME_DIR}/${GENOME}.fa \
    -nt 32 \
    -L $CHR \
    ${inputFileList} \
    -o ${CHR}_${1}_$(date +%s%N).vcf"

Here is the GATK output of one of the failures:

Sender: LSF System <lsfadmin@hpc0006>
Subject: Job 71822: <sh run_big_tabix_input.sh 2hundredset00.txt> in cluster <helion-poc> Exited

Job <sh run_big_tabix_input.sh 2hundredset00.txt> was submitted from host <login01> by user <jlawlor> in cluster <helion-poc>.
Job was executed on host(s) <32*hpc0006>, in queue <c7normal>, as user <jlawlor> in cluster <helion-poc>.
</gpfs/gpfs1/home/jlawlor> was used as the home directory.
</gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call> was used as the working directory.
Started at Wed Aug 16 06:55:28 2017
Results reported on Wed Aug 16 06:56:06 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
sh run_big_tabix_input.sh 2hundredset00.txt
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   560.00 sec.
    Max Memory :                                 25028 MB
    Average Memory :                             11828.50 MB
    Total Requested Memory :                     260000.00 MB
    Delta Memory :                               234972.00 MB
    Max Processes :                              4
    Max Threads :                                107

The output (if any) follows:

INFO  06:55:30,757 HelpFormatter - ----------------------------------------------------------------------------------
INFO  06:55:30,759 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  06:55:30,759 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  06:55:30,759 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  06:55:30,766 HelpFormatter - [Wed Aug 16 06:55:30 CDT 2017] Executing on Linux 3.10.0-327.3.1.el7.x86_64 amd64
INFO  06:55:30,767 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_65-b17
INFO  06:55:30,770 HelpFormatter - Program Args: -T GenotypeGVCFs -R /gpfs/gpfs1/myerslab/reference/genomes/bwa-0.7.8/GRCh37.fa -nt 32 -L 1 -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115394_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115395_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115396_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115397_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115398_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115399_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115400_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115401_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115402_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115403_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115404_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115405_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115415_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115416_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115417_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115418_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115419_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115420_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115421_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115422_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115423_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115424_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115425_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115426_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115427_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115428_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115429_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115430_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115431_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115432_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115433_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115434_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115435_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115436_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115437_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115438_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115439_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115440_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115441_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115442_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115443_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115444_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115445_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115446_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115447_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115448_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115449_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115450_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115451_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115452_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115453_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115454_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115455_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115456_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115457_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115458_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115459_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115460_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115461_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115462_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115463_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115464_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115465_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115466_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115467_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115468_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115469_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115470_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115471_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115472_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115473_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115474_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115475_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115476_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115477_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115478_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115479_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115480_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115481_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115482_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115483_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115484_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115485_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115486_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115487_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115488_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115489_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115490_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115491_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115492_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115493_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115494_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115495_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115496_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115497_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115498_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115499_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115500_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115501_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115502_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115503_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115504_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115505_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115506_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115507_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115508_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115509_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115510_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122119_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122120_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122121_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122122_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122123_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122124_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122125_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122126_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122127_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122128_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122129_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122130_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122131_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122132_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122133_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122134_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122135_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122136_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122137_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122138_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122139_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122140_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122141_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122142_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122143_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122144_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122145_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122146_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122147_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122148_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122149_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122150_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122151_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122152_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122153_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122154_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122155_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122156_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122157_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122158_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122159_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122160_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122161_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122162_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122163_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122164_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122165_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122166_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122167_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122168_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122169_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122170_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122171_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122172_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122173_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122174_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122175_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122176_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL122177_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126155_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126156_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126157_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126158_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126159_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126160_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126161_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126162_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126163_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126164_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126165_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126166_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126167_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126168_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126169_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126170_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126171_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126172_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126173_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126174_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126175_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126176_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126177_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126178_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126179_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126180_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126181_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126182_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126183_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126184_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126185_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126186_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL126187_1.g.vcf.gz -o 1_2hundredset00.txt_1502884529140524240.vcf
INFO  06:55:30,777 HelpFormatter - Executing as jlawlor@hpc0006 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17.
INFO  06:55:30,777 HelpFormatter - Date/Time: 2017/08/16 06:55:30
INFO  06:55:30,778 HelpFormatter - ----------------------------------------------------------------------------------
INFO  06:55:30,778 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  06:55:36,737 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  06:55:36,737 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  06:55:36,738 GenomeAnalysisEngine - Strictness is SILENT
INFO  06:55:36,894 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  06:55:45,190 IntervalUtils - Processing 249250621 bp from intervals
WARN  06:55:45,191 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  06:55:45,191 IndexDictionaryUtils - Track variant2 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  06:55:45,192 IndexDictionaryUtils - Track variant3 doesn't have a sequence dictionary built in, skipping dictionary validation
[ Removed for length ]
WARN  06:55:45,223 IndexDictionaryUtils - Track variant199 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  06:55:45,223 IndexDictionaryUtils - Track variant200 doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  06:55:45,232 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine
INFO  06:55:45,281 GenomeAnalysisEngine - Preparing for traversal
INFO  06:55:45,283 GenomeAnalysisEngine - Done preparing for traversal
INFO  06:55:45,284 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  06:55:45,285 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  06:55:45,285 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  06:55:45,928 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN  06:55:45,929 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO  06:55:45,929 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
WARN  06:55:51,116 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not GenotypeGVCFs
WARN  06:55:53,111 ExactAFCalculator - This tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1: 10445 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. Unless the DEBUG logging level is used, this warning message is output just once per run and further warnings are suppressed.
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Unable to create BasicFeatureReader using feature file , for input source: /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115506_1.g.vcf.gz
##### ERROR ------------------------------------------------------------------------------------------

In contrast, here is the output of a successful batch call with fewer samples:

Sender: LSF System <lsfadmin@hpc0006>
Subject: Job 71700: <sh run_big_tabix_input.sh x00.txt> in cluster <helion-poc> Done

Job <sh run_big_tabix_input.sh x00.txt> was submitted from host <login01> by user <jlawlor> in cluster <helion-poc>.
Job was executed on host(s) <32*hpc0006>, in queue <c7normal>, as user <jlawlor> in cluster <helion-poc>.
</gpfs/gpfs1/home/jlawlor> was used as the home directory.
</gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call> was used as the working directory.
Started at Tue Aug 15 15:45:26 2017
Results reported on Tue Aug 15 15:51:16 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
sh run_big_tabix_input.sh x00.txt
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   10318.77 sec.
    Max Memory :                                 72817 MB
    Average Memory :                             64894.37 MB
    Total Requested Memory :                     260000.00 MB
    Delta Memory :                               187183.00 MB
    Max Processes :                              4
    Max Threads :                                107

The output (if any) follows:

INFO  15:45:27,917 HelpFormatter - ----------------------------------------------------------------------------------
INFO  15:45:27,919 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  15:45:27,919 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  15:45:27,919 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  15:45:27,921 HelpFormatter - [Tue Aug 15 15:45:27 CDT 2017] Executing on Linux 3.10.0-327.3.1.el7.x86_64 amd64
INFO  15:45:27,921 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_65-b17
INFO  15:45:27,924 HelpFormatter - Program Args: -T GenotypeGVCFs -R /gpfs/gpfs1/myerslab/reference/genomes/bwa-0.7.8/GRCh37.fa -nt 32 -L 1 -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115394_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115395_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115396_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115397_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115398_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115399_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115400_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115401_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115402_1.g.vcf.gz -V /gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/bgzipped/SL115403_1.g.vcf.gz -o 1_x00.txt_1502829926656725768.vcf
INFO  15:45:27,929 HelpFormatter - Executing as jlawlor@hpc0006 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17.
INFO  15:45:27,930 HelpFormatter - Date/Time: 2017/08/15 15:45:27
INFO  15:45:27,930 HelpFormatter - ----------------------------------------------------------------------------------
INFO  15:45:27,930 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/gpfs/gpfs2/cooperlab/test_batch/inadvisable_batch_call/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  15:45:28,461 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  15:45:28,461 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  15:45:28,462 GenomeAnalysisEngine - Strictness is SILENT
INFO  15:45:28,621 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  15:45:28,940 IntervalUtils - Processing 249250621 bp from intervals
WARN  15:45:28,941 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,941 IndexDictionaryUtils - Track variant2 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,941 IndexDictionaryUtils - Track variant3 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant4 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant5 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant6 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant7 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant8 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant9 doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  15:45:28,942 IndexDictionaryUtils - Track variant10 doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  15:45:28,948 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine
INFO  15:45:28,992 GenomeAnalysisEngine - Preparing for traversal
INFO  15:45:28,994 GenomeAnalysisEngine - Done preparing for traversal
INFO  15:45:28,995 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  15:45:28,996 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  15:45:28,996 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  15:45:29,095 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN  15:45:29,096 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO  15:45:29,096 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
WARN  15:45:29,675 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not GenotypeGVCFs
WARN  15:45:31,082 ExactAFCalculator - This tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1: 12004479 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. Unless the DEBUG logging level is used, this warning message is output just once per run and further warnings are suppressed.
INFO  15:45:59,000 ProgressMeter -      1:32054701   1000000.0    30.0 s      30.0 s       12.9%     3.9 m       3.4 m
[Removed for length]
INFO  15:51:15,369 ProgressMeter - Total runtime 346.37 secs, 5.77 min, 0.10 hours
------------------------------------------------------------------------------------------
Done. There were 14 WARN messages, the first 10 are repeated below.
...

↧

How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

August 2, 2017, 11:09 pm

≫ Next: Questions about BQSR (2014-2016)

≪ Previous: Large Batch - "Unable to create BasicFeatureReader"

Hi. I’m studying about sequencing data analysis followed GATK Best practices for Germline SNP & Indel Discovery. Through the series of analysis, finally I get multi-sample VCF file from joint genotyping. After this, I want to get common variants from only three sample. I’ve conducted “Select variant” and using “—sample_name” argument and i get has three sample from vcf file. but this VCF file has not three sample variants, it contains all variant from multi-sample. so I want ask you any methods to get the common variants of three samples from multi-sample. Thank you

↧

Questions about BQSR (2014-2016)

January 15, 2017, 10:01 am

≫ Next: Well-calibrated likelihood that a variant is truly heterozygous rather than homozygous?

≪ Previous: How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

This discussion was created from comments split from: Base Quality Score Recalibration (BQSR).

↧

Well-calibrated likelihood that a variant is truly heterozygous rather than homozygous?

August 16, 2017, 11:02 am

≫ Next: CalculateGenotypePosteriors error

≪ Previous: Questions about BQSR (2014-2016)

Hi,

I have whole genome sequencing data, and I am trying to assemble a list of sites which are heterozygous with high confidence.

VariantRecalibrator seems to estimate two distributions: a (0/0) distribution and a (0/1 ∪ 1/1) distribution. So, I can quantify my confidence that a site is not 0/0.

However, I don't see any way to quantify confidence that a site is 0/1 as opposed to 1/1. I have not found any information about this in the documentation page for the tool, the FAQ page on setting its options, the tutorial, the forums, or the DePristo et al 2011 paper.

So, I have two questions:

Is it possible to calculate a well-calibrated likelihood ratio between the 0/1 and 1/1 possibilities, using VariantRecalibrator or another tool?
How do people generally go about performing tasks like mine, where they are trying to make a list of specifically heterozygous sites, excluding homozygous sites?

Thanks,

Alex

↧

CalculateGenotypePosteriors error

January 23, 2015, 1:49 pm

≫ Next: (howto) Run the genotype refinement workflow

≪ Previous: Well-calibrated likelihood that a variant is truly heterozygous rather than homozygous?

Hi,

I've run the GATK best practices pipeline up through VQSR and have the recalibrated variants in a VCF file. Because I'm analyzing pedigree samples, I'm now attempting to run CalculateGenotypePosteriors. When I do this, I get the following error:

ERROR MESSAGE: Variant does not contain the same number of MLE allele counts as alternate alleles for record at 1:768589

However, when I looked at that variant in the recalibrated VCF file, I see that MLEAC=1, AC=1, and I can confirm that there is only one sample that is heterozygous, so both AC and MLEAC are correct.

Might this be a bug?

Thanks,
Amy

↧

(howto) Run the genotype refinement workflow

October 17, 2014, 5:41 pm

≫ Next: GATK4 beta2 GenotypeGVCFs produces VCF with no records, just a header

≪ Previous: CalculateGenotypePosteriors error

Overview

This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data.

Step 1: Derive posterior probabilities of genotypes

In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped (which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting argument.

 java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf

This produces the output file recalibratedVariants.postCGP.vcf, in which the posteriors have been annotated wherever possible.

Step 2: Filter low quality genotypes

In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct.

java -jar $GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf

Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ with the FT tag of the FORMAT field.

Step 3: Annotate possible de novo mutations

In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members.

java -jar $GATKjar -T VariantAnnotator -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf

The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.

See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.

↧

GATK4 beta2 GenotypeGVCFs produces VCF with no records, just a header

July 20, 2017, 8:16 am

≫ Next: CombineVariants fail when combining vcf from genotypeGVCFs genotyped using -allSites flag

≪ Previous: (howto) Run the genotype refinement workflow

Hi,

GATK4 beta2 GenotypeGVCFs produces a VCF with no records on my test data. The file does contain a valid VCF header.

The commands that I used for GenomicsDBImport and GenotypeGVCFs are below. Both were run on the GATK4 beta2 jar.

gatk-launch --javaOptions '-Xms500m -Xmx28665m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/tmpKVSMhn' GenomicsDBImport -V DA_123_01.vcf.gz -V DA_123_02.vcf.gz -V DA_123_03.vcf.gz -V DA_123_04.vcf.gz -V DA_123_05.vcf.gz -V DA_123_06.vcf.gz -V DA_123_07.vcf.gz -V DA_123_08.vcf.gz --genomicsDBWorkspace DEV_1066_Chr_01 --intervals Chr_01
gatk-launch --javaOptions '-Xms500m -Xmx28665m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/tmpKVSMhn' GenotypeGVCFs -R ./my_species.fa -V gendb://DEV_1066_Chr_01 -G StandardAnnotation -newQual -O DEV_1066_Chr_01.vcf.gz

The output of both tools does not show anything strang. The GenotypeGVCFs tool finishes with these statements:

17:07:24.554 INFO  ProgressMeter -      Chr_01:38007262             12.4              17243000        1394831.5
17:07:34.561 INFO  ProgressMeter -      Chr_01:38743870             12.5              17590000        1403959.7
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),500.6120602850972,Cpu time(s),450.9598250043053
17:07:37.965 INFO  ProgressMeter -      Chr_01:39060683             12.6              17742934        1409782.4
17:07:37.965 INFO  ProgressMeter - Traversal complete. Processed 17742934 total variants in 12.6 minutes.
17:07:37.975 WARN  IntelDeflaterFactory - IntelDeflater is not supported, using Java.util.zip.Deflater
17:07:37.976 INFO  GenotypeGVCFs - Shutting down engine
[July 20, 2017 5:07:37 PM CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 12.61 minutes.
Runtime.totalMemory()=506986496

The gVCF files were produced with GATK4 beta1. Is there a way that this issue can be solved or that I can further debug this issue?

Thank you.

↧

CombineVariants fail when combining vcf from genotypeGVCFs genotyped using -allSites flag

August 3, 2017, 5:13 pm

≫ Next: How can I use parallelism to make GATK tools run faster?

≪ Previous: GATK4 beta2 GenotypeGVCFs produces VCF with no records, just a header

When I attempt to run the tools CombineVariant with two vcf files generated from GenotypeGVCFs using the flag -allSites I get an error saying that I should not use combineVariants in a gvcf file. But my files are not gvcf.

Error:

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead

Command used:
java -jar GenomeAnalysisTK.jar
-T CombineVariants
-R /path_to_reference/REFERENCE.fa
--variant /path_to_genotypeGVCFs_output/species1_allsites.vcf
--variant /path_to_genotypeGVCFs_output/species2_allsites.vcf
-o allcombined.vcf
-genotypeMergeOptions UNIQUIFY

gatk version v3.7-0

If I run it in the vcfs from GenotypeGVCFs not using the flag -allSites, I don't get this error and it runs fine.

↧

How can I use parallelism to make GATK tools run faster?

December 14, 2012, 1:59 pm

≫ Next: My VQSR tranches-plot shows cumulative variants in tranch 0-90, 90-99, 99-99.9

≪ Previous: CombineVariants fail when combining vcf from genotypeGVCFs genotyped using -allSites flag

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue or Crom/WDL).

Multi-threading options

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

-nt / --num_threads
controls the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_thread
controls the number of CPU threads allocated to each data thread

For more information on how these multi-threading options work, please read the primer on parallelism for the GATK.

Memory considerations for multi-threading

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

Additional consideration when using `-nct` with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the documentation on pipelining options.

Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool	Full name	Type of traversal	NT	NCT	SG
RTC	RealignerTargetCreator	RodWalker	+	-	-
IR	IndelRealigner	ReadWalker	-	-	+
BR	BaseRecalibrator	LocusWalker	-	+	+
PR	PrintReads	ReadWalker	-	+	-
RR	ReduceReads	ReadWalker	-	-	+
HC	HaplotypeCaller	ActiveRegionWalker	-	(+)	+
UG	UnifiedGenotyper	LocusWalker	+	+	+

Note that while HaplotypeCaller supports -nct in principle, many have reported that it is not very stable (random crashes may occur -- but if there is no crash, results will be correct). We prefer not to use this option with HC; use it at your own risk.

Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool	RTC	IR	BR	PR	RR	HC	UG
Available modes	NT	SG	NCT,SG	NCT	SG	NCT,SG	NT,NCT,SG
Cluster nodes	1	4	4	1	4	4	4 / 4 / 4
CPU threads (`-nct`)	1	1	8	4-8	1	4	3 / 6 / 24
Data threads (`-nt`)	24	1	1	1	1	1	8 / 4 / 1
Memory (Gb)	48	4	4	4	4	16	32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue or other data parallelization framework. For more details on scatter-gather, see the primer on parallelism for the GATK and the documentation on pipelining options.

↧

My VQSR tranches-plot shows cumulative variants in tranch 0-90, 90-99, 99-99.9

August 3, 2017, 11:43 pm

≫ Next: Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

≪ Previous: How can I use parallelism to make GATK tools run faster?

Dear GATK-Team,

My VQSR tranches-plot (exome data) shows cumulative variants in tranch 0-90, 90-99, 99-99.9. To my understanding it should be the other way round (like in your article link).

My tranch file is:

# Variant quality score tranches file
# Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,1047316,37823,2.1789,1.2078,3.4763,VQSRTrancheSNP0.00to90.00,SNP,664834,598350,0.9000
99.00,1256170,54507,2.1551,1.2197,0.1249,VQSRTrancheSNP90.00to99.00,SNP,664834,658185,0.9900
99.90,1303686,68389,2.1437,1.2255,-4.2698,VQSRTrancheSNP99.00to99.90,SNP,664834,664169,0.9990
100.00,1338832,87080,2.1302,1.2417,-1299.3394,VQSRTrancheSNP99.90to100.00,SNP,664834,664834,1.0000

Greetings

Martin

↧

Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

August 3, 2017, 11:48 pm

≫ Next: Forum Search Problem

≪ Previous: My VQSR tranches-plot shows cumulative variants in tranch 0-90, 90-99, 99-99.9

Hi GATK Team,
HaplotypeCaller does not call structural variants from soft clipped bases, therefore the "--dontUseSoftClippedBases" should mainly reduce false positives (e.g. incomplete adapter trimming)? Is this thought correct or am i wrong?

Greetings from Munich

↧

Forum Search Problem

August 4, 2017, 5:52 am

≫ Next: Truth & Control sources- HapMap and 1000G

≪ Previous: Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

Hi there,

I appear to be unable to navigate to pages beyond page 1 in search results on the forum site. The page navigation numbers near the bottom of the search results page appear to not be selectable.

Is anyone else running into this?

OS: Ubuntu 14.04
Browser: Google Chrome 60.0.3112.90

↧

Truth & Control sources- HapMap and 1000G

August 4, 2017, 7:35 am

≫ Next: muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

≪ Previous: Forum Search Problem

Hi everyone,

I apologize in advance if this question seems like a stupid one, but I have always thought that sources such as HapMap and 1000G from the resource bundle that we use in VQSR are comprised of many global samples, but when I peaked inside of the vcfs, I only saw a reference and alternate allele for seemingly 1 sample only. What am I missing here?

If the multisample genotype info is somehow Incorporated into the vcf index file then is there a way to display the contents of the index file so that I can remove all African samples since they are totally irrelevant to my test sample and seem to be negatively affecting The calibration and the calls for my test sample

↧

muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

October 18, 2016, 11:33 am

≫ Next: java.lang.RuntimeException: Error processing input from file.bam: Invalid reference index -1

≪ Previous: Truth & Control sources- HapMap and 1000G

Hello! I am using muTect2 (in particular I am following this pipeline: http://gatkforums.broadinstitute.org/gatk/discussion/5963/tumor-normal-paired-exome-sequencing-pipeline) but today I am getting this error on chromosome 3:

ERROR --

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Too many deletions?

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:490)

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:436)

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:427)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:543)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesLeftTail(ReadClipper.java:177)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:408)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:411)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.finalizeActiveRegion(MuTect2.java:1201)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.assembleReads(MuTect2.java:1145)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:536)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:176)

at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)

at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)

at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?

ERROR ------------------------------------------------------------------------------------------

Thank you in advance!
Best

↧

java.lang.RuntimeException: Error processing input from file.bam: Invalid reference index -1

August 16, 2017, 6:25 pm

≫ Next: 热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

≪ Previous: muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

I'm getting the error above when I run Genome Strip discovery pipeline. I would appreciate help troubleshooting. Below are the lines in the output beginning with the first error through the end of the output, which include the stack trace. Thank you in advance.

ERROR 20:21:24,583 FunctionEdge - Error:  'java'  '-Xmx4096m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/scratch/ernestjb/genomeStrip.2017-07-10/discovery/tmp'  '-cp' '/data/ernestjb/svtoolkit/lib/SVToolkit.jar:/data/ernestjb/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/data/ernestjb/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVDiscovery '-T' 'SVDiscoveryWalker'  '-R' '/scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta'  '-I' 'bamFiles.1.2017-05-15.list'  '-O' '/scratch/ernestjb/genomeStrip.2017-07-10/discovery/P0085.discovery.vcf.gz'  '-disableGATKTraversal' 'true'  '-md' '/scratch/ernestjb/genomeStrip.2017-07-10/preprocess'  '-configFile' '/data/ernestjb/svtoolkit/conf/genstrip_parameters.txt'  '-runDirectory' '/scratch/ernestjb/genomeStrip.2017-07-10/discovery'  '-genderMapFile' '/scratch/ernestjb/genomeStrip.2017-07-10/preprocess/sample_gender.report.txt'  '-genomeMaskFile' '/scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.svmask.fasta'  '-partitionName' 'P0085'  '-runFilePrefix' 'P0085'  '-storeReadPairFile' 'true'  -L NC_007605:1-171823 -searchLocus NC_007605:1-171823 -searchWindow NC_007605:1-171823 -searchMinimumSize 100 -searchMaximumSize 100000
ERROR 20:21:24,590 FunctionEdge - Contents of /spin1/home/linux/ernestjb/code/genomeStrip/SVDiscovery-85.out:
INFO  20:21:14,689 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  20:21:14,692 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5.GS-r1732-0-gf101448, Compiled 2017/04/18 15:39:27
INFO  20:21:14,692 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  20:21:14,692 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  20:21:14,696 HelpFormatter - Program Args: -T SVDiscoveryWalker -R /scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta -O /scratch/ernestjb/genomeStrip.2017-07-10/discovery/P0085.discovery.vcf.gz -disableGATKTraversal true -md /scratch/ernestjb/genomeStrip.2017-07-10/preprocess -configFile /data/ernestjb/svtoolkit/conf/genstrip_parameters.txt -runDirectory /scratch/ernestjb/genomeStrip.2017-07-10/discovery -genderMapFile /scratch/ernestjb/genomeStrip.2017-07-10/preprocess/sample_gender.report.txt -genomeMaskFile /scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.svmask.fasta -partitionName P0085 -runFilePrefix P0085 -storeReadPairFile true -L NC_007605:1-171823 -searchLocus NC_007605:1-171823 -searchWindow NC_007605:1-171823 -searchMinimumSize 100 -searchMaximumSize 100000
INFO  20:21:14,701 HelpFormatter - Executing as ernestjb@cn3114 on Linux 2.6.32-642.3.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14.
INFO  20:21:14,701 HelpFormatter - Date/Time: 2017/08/16 20:21:14
INFO  20:21:14,701 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  20:21:14,701 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  20:21:15,048 16-Aug-2017 GenomeAnalysisEngine - Strictness is SILENT
INFO  20:21:15,194 16-Aug-2017 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  20:21:15,233 16-Aug-2017 IntervalUtils - Processing 171823 bp from intervals
INFO  20:21:15,320 16-Aug-2017 GenomeAnalysisEngine - Preparing for traversal
INFO  20:21:15,322 16-Aug-2017 GenomeAnalysisEngine - Done preparing for traversal
INFO  20:21:15,322 16-Aug-2017 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  20:21:15,322 16-Aug-2017 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  20:21:15,322 16-Aug-2017 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  20:21:15,322 16-Aug-2017 SVDiscovery - Initializing SVDiscovery ...
INFO  20:21:15,322 16-Aug-2017 SVDiscovery - Reading configuration file ...
INFO  20:21:15,328 16-Aug-2017 SVDiscovery - Read configuration file.
INFO  20:21:15,328 16-Aug-2017 SVDiscovery - Opening reference sequence ...
INFO  20:21:15,329 16-Aug-2017 SVDiscovery - Opened reference sequence.
INFO  20:21:15,329 16-Aug-2017 SVDiscovery - Opening genome mask ...
INFO  20:21:15,336 16-Aug-2017 SVDiscovery - Opened genome mask.
INFO  20:21:15,336 16-Aug-2017 SVDiscovery - Initializing input data set ...
INFO  20:21:15,433 16-Aug-2017 SVDiscovery - Initialized data set: 16 files, 93 read groups, 16 samples.
INFO  20:21:15,434 16-Aug-2017 MetaData - Opening metadata ...
INFO  20:21:15,436 16-Aug-2017 MetaData - Adding metadata directory /scratch/ernestjb/genomeStrip.2017-07-10/preprocess ...
INFO  20:21:15,453 16-Aug-2017 MetaData - Opened metadata.
INFO  20:21:15,457 16-Aug-2017 SVDiscovery - Opened metadata.
INFO  20:21:15,462 16-Aug-2017 MetaData - Loading insert size distributions ...
INFO  20:21:15,637 16-Aug-2017 SVDiscovery - Processing locus: NC_007605:1-171823:100-100000
INFO  20:21:15,638 16-Aug-2017 SVDiscovery - Locus search window: NC_007605:1-171823
Caught exception while processing read: null
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.RuntimeException: Error processing input from /scratch/ernestjb/sfari/11002/BAM/Sample_SSC03070/analysis/SSC03070.final.bam: Invalid reference index -1
  at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.runTraversal(DeletionDiscoveryAlgorithm.java:159)
  at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:105)
  at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:40)
  at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
  at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
  at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
  at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
  at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:133)
  at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
  at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
  at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:87)
  at org.broadinstitute.sv.main.SVDiscovery.main(SVDiscovery.java:21)
Caused by: java.lang.IllegalArgumentException: Invalid reference index -1
  at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
  at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:504)
  at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryContained(SamReader.java:376)
  at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.runTraversal(DeletionDiscoveryAlgorithm.java:145)
  ... 11 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5.GS-r1732-0-gf101448):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Error processing input from /scratch/ernestjb/sfari/11002/BAM/Sample_SSC03070/analysis/SSC03070.final.bam: Invalid reference index -1
##### ERROR ------------------------------------------------------------------------------------------
INFO  20:21:24,591 QGraph - Writing incremental jobs reports...
INFO  20:21:24,591 QJobsReporter - Writing JobLogging GATKReport to file /spin1/home/linux/ernestjb/code/genomeStrip/SVDiscovery.jobreport.txt
INFO  20:21:24,598 QGraph - 5 Pend, 0 Run, 1 Fail, 84 Done
INFO  20:21:24,599 QCommandLine - Writing final jobs report...
INFO  20:21:24,600 QJobsReporter - Writing JobLogging GATKReport to file /spin1/home/linux/ernestjb/code/genomeStrip/SVDiscovery.jobreport.txt
INFO  20:21:24,604 QJobsReporter - Plotting JobLogging GATKReport to file /spin1/home/linux/ernestjb/code/genomeStrip/SVDiscovery.jobreport.pdf
WARN  20:21:26,315 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
INFO  20:21:26,318 QCommandLine - Done with errors
INFO  20:21:26,342 QGraph - -------
INFO  20:21:26,344 QGraph - Failed:   'java'  '-Xmx4096m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/scratch/ernestjb/genomeStrip.2017-07-10/discovery/tmp'  '-cp' '/data/ernestjb/svtoolkit/lib/SVToolkit.jar:/data/ernestjb/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/data/ernestjb/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVDiscovery '-T' 'SVDiscoveryWalker'  '-R' '/scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta'  '-I' 'bamFiles.1.2017-05-15.list'  '-O' '/scratch/ernestjb/genomeStrip.2017-07-10/discovery/P0085.discovery.vcf.gz'  '-disableGATKTraversal' 'true'  '-md' '/scratch/ernestjb/genomeStrip.2017-07-10/preprocess'  '-configFile' '/data/ernestjb/svtoolkit/conf/genstrip_parameters.txt'  '-runDirectory' '/scratch/ernestjb/genomeStrip.2017-07-10/discovery'  '-genderMapFile' '/scratch/ernestjb/genomeStrip.2017-07-10/preprocess/sample_gender.report.txt'  '-genomeMaskFile' '/scratch/ernestjb/Homo_sapiens_assembly19/Homo_sapiens_assembly19.svmask.fasta'  '-partitionName' 'P0085'  '-runFilePrefix' 'P0085'  '-storeReadPairFile' 'true'  -L NC_007605:1-171823 -searchLocus NC_007605:1-171823 -searchWindow NC_007605:1-171823 -searchMinimumSize 100 -searchMaximumSize 100000
INFO  20:21:26,344 QGraph - Log:     /spin1/home/linux/ernestjb/code/genomeStrip/SVDiscovery-85.out
INFO  20:21:26,347 QCommandLine - Script failed: 5 Pend, 0 Run, 1 Fail, 84 Done

↧

热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

July 1, 2017, 8:33 pm

≫ Next: the point of Co-realign in indel realignment?

≪ Previous: java.lang.RuntimeException: Error processing input from file.bam: Invalid reference index -1

科研圈的亲们，我们来啦！携手国内重量级公司和机构，我们这次给大家带来了高效、规模化使用GATK的技巧！

Today we are reaching out to the Chinese research community with great news: we are partnering with key companies and institutions in China to empower Chinese researchers to use GATK effectively and at scale.

大家可能已经有所耳闻，我们开发了一整套基因组数据分析系统，涵盖了分析工具（GATK4，即将发行），流程控制语言（WDL），以及支持多种环境下--包括本地数据中心和云计算--执行分析流程的运行核心Cromwell 。这一整合的套装是为了让生物医药研究人员可以自如的运行、重复分析流程，包括我们一直以来推崇的GATK最佳运行方案（现在我们已经发布了可以即刻使用的流程）。我们希望这一系列努力可以大大减轻之前大家设置以及运行GATK的各种困扰以及烦人的臆测。

As you may know, we have developed a "full stack" genomics solution that combines analysis tools (GATK itself, with version 4 soon to be released), a workflow definition language called WDL, and an execution engine called Cromwell that can execute pipelines in multiple environments, on-premises and on the cloud. This integrated solution aims to empower biomedical researchers to run and replicate analysis pipelines, starting with the GATK Best Practices, for which we are now publishing ready-to-use WDL workflows. We hope this will dramatically cut down on the effort -- and sometimes guesswork! -- previously involved in standing up GATK pipelines.

当然我们并没有止步于单纯的提供工具和流程软件，我们也希望能在主流云计算平台下放飞这一系列匠心独具的设计。两年前，我们在努力开发这些软件工具的伊始，也开始了与六家行业引领者的合作：因特尔，谷歌，Cloudera，亚马逊，IBM以及微软；这六家公司和我们有着同样的目标：让广大用户可以在云平台下自如的使用我们的软件。

But our goals didn't stop at just building the pipelining software -- we wanted to make sure our tools would be easy to use on any of the major public clouds. So two years ago, as we were knuckling down to the hard work of developing these software tools, we forged a partnership with six industry leaders who agreed to help us bring our solution to the Cloud -- Intel, Google, Cloudera, Amazon Web Services (AWS), IBM and Microsoft.

现在我们重磅推出与阿里云，以及华大基因的合作！阿里云是中国主要的云计算运营商，而华大是主要的基因测序中心。两家机构都认同并且愿意帮助我们实现共同的目标：为全球每一位科研人员提供最好的、可重复的基因组数据分析流程软件。巧合的是现在我们的云端服务伙伴刚刚好是幸运数字八！同时我们也在积极的与其他研究所和商业机构商洽，包括中科院北京基因组研究所、诺禾致源、浪潮集团，他们也都表示了采纳我们分析套装的兴趣。

Now, we are thrilled that Alibaba Cloud, the major cloud service provider in China, and BGI, the major sequencing service provider, are both signing on to help in the pursuit of our common goal, which is to provide top-quality, reproducible genomics pipelines to everyone in the global research community. It is a happy coincidence that this brings our fellowship of the Cloud to a lucky number eight! We are also engaging with other key companies and institutions in China, including the Beijing Institute of Genomics, Novogene and Inspur, who have expressed interest in adopting our genomics stack.

仲有！我们同时意识到对于说汉语的亲们，语言会是一个障碍，所以我们也在考虑建立一个推广项目，专门为汉语用户圈服务。这将包括汉语论坛，GATK和WDL注释文档的翻译，以及在中国开办的研讨会。这对我们来说是个挑战，但是我们乐观地相信这会给双方都带来巨大的好处和互相学习的机会。

But that's not all. We're aware that language is often an obstacle for our Chinese audience, so we are looking at options for establishing an outreach program specifically aimed at the Chinese community. This would include a Chinese-language forum, translations of the GATK and WDL documentation, as well as workshops in China. This will be a challenging new undertaking for us but I am optimistic that it will yield great benefits, as I am certain our communities have much to learn from each other.

最后夹带一些私货：我个人很高兴看到我们能以这样的方式与中国的研究圈联系。2008年，我因为一个研究项目在位于火炉武汉的华中农业大学度过了一个夏天。华农的亲们对我的热情让我终身难忘，今日终于可以投桃报李了！

Finally, I should mention I have personal reasons for being especially pleased that we are reaching out to the Chinese research community in this way. In 2008, I spent several months living and working on a research project at Huazhong Agricultural University in Wuhan, Hubei Province, and I will never forget the wonderful welcome I was given by the staff and students at HZAU. I look forward to finally reciprocating that welcome, at scale!

无图无真相，在武汉的2008奥运会火炬传递仪式中（欢欢）
Photographic evidence… at the 2008 Olympic torch parade in Wuhan!

Many thanks to members of the Intel China team and to Steve Huang of the GATK development team for their invaluable help with the translation!

↧

the point of Co-realign in indel realignment?

August 17, 2017, 2:51 am

≫ Next: (How to) Run the GATK4 Docker locally and take a look inside

≪ Previous: 热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

GATK RealignerTargetCreator 's input is a normal-tumor bam list,is there any difference in output with put the bam files one by one?why need the co-realign,why not just realignment one by one?

↧

(How to) Run the GATK4 Docker locally and take a look inside

August 15, 2017, 11:02 am

≫ Next: GATK 4.beta.2 vs. 4.beta.3 performance

≪ Previous: the point of Co-realign in indel realignment?

Document is in `BETA`. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

1. Install Docker on your system

Install Docker for your system from https://docs.docker.com/engine/installation/, e.g. for Mac, Windows or Linux servers. There is also a program called Docker Toolbox and I have this installed but I don't think it's necessary for running Docker containers locally or on a server.

On my Mac, I just double-click on the Docker whale icon to start the application. Check that Docker is running in the Mac menu bar at top by clicking on the icon that looks like a whale-container-ship.

2. Check your Docker software installation

See the Docker version with docker --version.

$ docker --version
Docker version 17.06.0-ce, build 02c1d87

If you have trouble, you may need to run one or a number of the following commands.

docker-machine restart default

docker-machine regenerate-certs

docker-machine env

3. Download a Docker image from Dockerhub

In Docker, an image is the original from which we launch containers. We pull images from Dockerhub (https://hub.docker.com/), using Git like lingo. For example, the following command downloads a GATK4 docker image.

docker pull broadinstitute/gatk:4.beta.3

The part after the colon is the version of the container we pull. You can see which images you have locally with docker image ls. Here we see I have two different versions of broadinstitute/gatk, v4.beta.3 and v4.beta.2.

$ docker image ls
REPOSITORY                            TAG                    IMAGE ID            CREATED             SIZE
broadinstitute/gatk                   4.beta.3               5c138c493794        2 weeks ago         2.87GB
broadinstitute/gatk                   4.beta.2               507406cb4d85        3 weeks ago         2.88GB

4. Inspect a Docker image by running a container

There are two ways to inspect an image. One is with docker inspect 5c138c493794. The other is to launch a container off the image and root around within it much like you would a file system.

The broadinstitute/gatk image is built automatically from a script documented at https://github.com/broadinstitute/gatk/blob/master/scripts/docker/. For tools that the script installs, see https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/Dockerfile.

Launch a container with its tag or image ID. Whichever you use to launch a container, the tag or image ID, it becomes the image name.

docker run -i -t 5c138c493794

docker run -i -t broadinstitute/gatk:4.beta.3

We see then our bash opens into a location in the container preset by those who built the image.

root@f944f81ff6d7:/gatk#

We can check the contents of the current directory and the java version.

root@f944f81ff6d7:/gatk# ls -ltrh
total 148K
drwxr-xr-x  4 root root 4.0K Jul 26 15:49 docs
-rw-r--r--  1 root root  428 Jul 26 15:49 codecov.yml
-rwxr-xr-x  1 root root 4.5K Jul 26 15:49 build_docker.sh
-rw-r--r--  1 root root  21K Jul 26 15:49 build.gradle
-rw-r--r--  1 root root  33K Jul 26 15:49 README.md
-rw-r--r--  1 root root 1.5K Jul 26 15:49 LICENSE.TXT
-rw-r--r--  1 root root  690 Jul 26 15:49 Dockerfile
-rw-r--r--  1 root root  775 Jul 26 15:49 AUTHORS
drwxr-xr-x  1 root root 4.0K Jul 26 15:49 src
-rw-r--r--  1 root root   26 Jul 26 15:49 settings.gradle
drwxr-xr-x 10 root root 4.0K Jul 26 15:49 scripts
drwxr-xr-x  2 root root 4.0K Jul 26 15:49 resources_for_CI
-rwxr-xr-x  1 root root 5.2K Jul 26 15:49 gradlew
drwxr-xr-x  3 root root 4.0K Jul 26 15:49 gradle
-rwxr-xr-x  1 root root  19K Jul 26 15:49 gatk-launch
drwxr-xr-x  9 root root 4.0K Jul 26 15:53 build
-rw-r--r--  1 root root   40 Jul 26 15:55 run_unit_tests.sh
lrwxrwxrwx  1 root root   25 Jul 26 15:55 gatk.jar -> /gatk/build/libs/gatk.jar
-rw-r--r--  1 root root 1017 Jul 26 15:55 install_R_packages.R
root@96d91017226e:/gatk#

root@f944f81ff6d7:/gatk# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
root@f944f81ff6d7:/gatk#

When we exit out of the container, by typing exit, we exit out of it and also stop it from running. We can check all the stopped container instances that docker saves automatically with docker ps -a.

$ docker ps -a
CONTAINER ID        IMAGE                          COMMAND             CREATED              STATUS                     PORTS               NAMES
28035a3b71f1        broadinstitute/gatk:4.beta.3   "bash"              About a minute ago   Exited (0) 8 seconds ago                       silly_davinci
f944f81ff6d7        5c138c493794                   "bash"              6 minutes ago        Exited (0) 4 minutes ago                       fervent_wing
62fb9991a939        5c138c493794                   "bash"              6 minutes ago        Exited (0) 6 minutes ago                       tender_mirzakhani
96d91017226e        5c138c493794                   "bash"              3 days ago           Exited (0) 2 days ago                          vigilant_montalcini

As you can see, I have multiple containers launched from the same image. Notice, however, each container has a unique ID (under CONTAINER ID) and name (under NAMES). Whatever changes I make within a container get saved to that container. We can remove containers with docker container rm using either the container ID or name.

$ docker container rm silly_davinci
silly_davinci
$ docker ps -a
CONTAINER ID        IMAGE                      COMMAND             CREATED             STATUS                      PORTS               NAMES
f944f81ff6d7        5c138c493794               "bash"              11 minutes ago      Exited (0) 9 minutes ago                        fervent_wing
62fb9991a939        5c138c493794               "bash"              11 minutes ago      Exited (0) 11 minutes ago                       tender_mirzakhani
96d91017226e        5c138c493794               "bash"              3 days ago          Exited (0) 2 days ago                           vigilant_montalcini

$ docker container rm f944f81ff6d7
f944f81ff6d7
$ docker ps -a
CONTAINER ID        IMAGE                      COMMAND             CREATED             STATUS                      PORTS               NAMES
62fb9991a939        5c138c493794               "bash"              12 minutes ago      Exited (0) 12 minutes ago                       tender_mirzakhani
96d91017226e        5c138c493794               "bash"              3 days ago          Exited (0) 2 days ago                           vigilant_montalcini

We can run one of these containers with docker start.

docker start 96d91017226e

It may take a minute for a container to start up. We can see the running containers with docker container ls.

$ docker container ls
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
96d91017226e        5c138c493794        "bash"              3 days ago          Up About a minute                       vigilant_montalcini

Finally, we can reattach to the running container.

docker attach vigilant_montalcini

On my local Mac, there is a glitch and I must press enter twice to show the docker container's bash prompt. You can also use the container ID instead of the name in the command. To exit out of a running container without stopping it, use Ctrl+P+Q.

5. Copy files from local system to the running container

There are two ways to do this, from within the container and from outside the container. I only know how to copy files from outside the container. The container can be stopped or running.

docker cp file_you_want_to_copy <container_id>:<file_path_to_target_dirctory>

For example,

docker cp tumor.seg 96d91017226e:/gatk

Copies the file tumor.seg into the container 96d91017226e's /gatk directory.

6. Save a modified container as an image and upload to Dockerhub

If you will modify a container to save, then remember that environmental variables, e.g. in bashrc, do not work in Docker containers. However, symlinks work well and you should create these in, e.g. /usr/bin with the ln -s path/to/item short_cut_name.

First, log into your Dockerhub account with docker login. If you don't have one, create one at https://hub.docker.com. My account is called spacecade7. For the container you have modified and wish to save a snapshot image of, use the following command.

docker commit 96d91017226e spacecade7/mygatk:versioning_tag1

Where the string that follows commit is the container ID. The last part points to my Dockerhub account followed by what I would like to call the image and an image version tag. This saves the image locally.

To save the image to Dockerhub, use docker push spacecade7/mygatk:versioning_tag1. The image should appear in your Dockerhub account.

↧

GATK 4.beta.2 vs. 4.beta.3 performance

August 17, 2017, 7:09 am

≫ Next: What is this error? and how to resolve it? It is related to generating pdf

≪ Previous: (How to) Run the GATK4 Docker locally and take a look inside

Hi, I've been keeping up with the GATK 4.0 beta releases and I've noticed some performance differences between my runs of version 4.beta.2 and 4.beta.3. That is, 4.beta.3 tools seem to have a longer execution time. The runtimes and tools in question are:

Tool	4.beta.2	4.beta.3
BaseRecalibrator	1m 3s	3m 3s
ApplyBQSR (scattered)	4m 48s	11m 51s
HaplotypeCaller (scattered)	23m 42s	29m 7s
GenotypeGVCFs (scattered)	4m 6s	9m 28s
VariantRecalibrator (for SNPs)	4m 7s	6m 38s
VariantRecalibrator (for INDELs)	2m 7s	4m 8s
ApplyVQSR (for SNPs)	37s	2m 36s
ApplyVQSR (for INDELs)	39s	2m 35s

Note: The scattered tools are not the Spark version, but are just scattered on intervals from a custom BED file. The FASTQs used for these runs are synthetic exome files (just for quick and easy testing purposes). The computation was done in cloud, not on a local machine. The roughly same runtimes were acquired after reruns.

I haven't found anything in the release notes that would suggest these results. Are these runtimes to be expected?

Thank you for your help!

↧

What is this error? and how to resolve it? It is related to generating pdf

August 17, 2017, 11:22 am

≫ Next: picard markdup error:Value was put into PairInfoMap more than once

≪ Previous: GATK 4.beta.2 vs. 4.beta.3 performance

INFO 13:15:42,730 AnalyzeCovariates - Generating plots file 'LB0003_recal_plots.pdf'

ERROR --

ERROR stack trace

org.broadinstitute.gatk.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
at org.broadinstitute.gatk.utils.R.RScriptExecutor.exec(RScriptExecutor.java:176)
at org.broadinstitute.gatk.engine.recalibration.RecalUtils.generatePlots(RecalUtils.java:555)
at org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:373)
at org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates.initialize(AnalyzeCovariates.java:387)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.

ERROR ------------------------------------------------------------------------------------------

As per my knowledge, the issue is to install four libraries namely ggplot2,gplots, reshape, gsalib.

Am I correct?

If yes where to install these libraries?

below is the command I am running

java -jar ../GenomeAnalysisTK.jar -T AnalyzeCovariates -R ref_Manta.fa -before LB0003_recal_data.2.table -after LB0003_post_recal_data.2.table -plots LB0003_recal_plots.pdf -log LB0003_plots_efile

↧

ERROR MESSAGE: Variant does not contain the same number of MLE allele counts as alternate alleles for record at 1:768589

Overview

Step 1: Derive posterior probabilities of genotypes

Step 2: Filter low quality genotypes

Step 3: Annotate possible de novo mutations

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead

Overview

Multi-threading options

Memory considerations for multi-threading

Additional consideration when using -nct with versions 2.2 and 2.3

Scatter-gather

Applicability of parallelism to the major GATK tools

Recommended configurations

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?

ERROR ------------------------------------------------------------------------------------------

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

1. Install Docker on your system

2. Check your Docker software installation

3. Download a Docker image from Dockerhub

4. Inspect a Docker image by running a container

5. Copy files from local system to the running container

6. Save a modified container as an image and upload to Dockerhub

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.

ERROR ------------------------------------------------------------------------------------------

Additional consideration when using `-nct` with versions 2.2 and 2.3

Document is in `BETA`. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.