MergeBamAlignment without ALT contigs

June 18, 2018, 8:29 am

≫ Next: Strange behaviour (bias?) in BaseRecalibrator

≪ Previous: (How to) Map reads to a reference with alternate contigs like GRCh38

Hi,

I'd like to modify the five-dollar-genome pipeline to run with hg19 / b37 reference data.

I'm failing to understand why it is hard-coded to require the ALT contigs:

    # if ref_alt has data in it,
    if [ -s ${ref_alt} ]; then
      java -Xms5000m -jar /usr/gitc/picard.jar \
        SamToFastq \
        INPUT=${input_bam} \
        FASTQ=/dev/stdout \
        INTERLEAVE=true \
        NON_PF=true | \
      /usr/gitc/${bwa_commandline} /dev/stdin - 2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) | \
      java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /usr/gitc/picard.jar \
        MergeBamAlignment \
        VALIDATION_STRINGENCY=SILENT \
        EXPECTED_ORIENTATIONS=FR \
        ATTRIBUTES_TO_RETAIN=X0 \
        ATTRIBUTES_TO_REMOVE=NM \
        ATTRIBUTES_TO_REMOVE=MD \
        ALIGNED_BAM=/dev/stdin \
        UNMAPPED_BAM=${input_bam} \
        OUTPUT=${output_bam_basename}.bam \
        REFERENCE_SEQUENCE=${ref_fasta} \
        PAIRED_RUN=true \
        SORT_ORDER="unsorted" \
        IS_BISULFITE_SEQUENCE=false \
        ALIGNED_READS_ONLY=false \
        CLIP_ADAPTERS=false \
        MAX_RECORDS_IN_RAM=2000000 \
        ADD_MATE_CIGAR=true \
        MAX_INSERTIONS_OR_DELETIONS=-1 \
        PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
        PROGRAM_RECORD_ID="bwamem" \
        PROGRAM_GROUP_VERSION="${bwa_version}" \
        PROGRAM_GROUP_COMMAND_LINE="${bwa_commandline}" \
        PROGRAM_GROUP_NAME="bwamem" \
        UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
        ALIGNER_PROPER_PAIR_FLAGS=true \
        UNMAP_CONTAMINANT_READS=true \
        ADD_PG_TAG_TO_READS=false

      grep -m1 "read .* ALT contigs" ${output_bam_basename}.bwa.stderr.log | \
      grep -v "read 0 ALT contigs"

    # else ref_alt is empty or could not be found
    else
      exit 1;
    fi

(https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline/blob/3ad22df4bfaa605b6a5504f110264b2b08100128/tasks_pipelines/alignment.wdl#L71)

So it is set to fail if ref_alt file is empty even though the command is not using that file.

Could you open the logic here - and maybe point how the correct command would look like with hg19 reference?

Thanks!

↧

Strange behaviour (bias?) in BaseRecalibrator

January 31, 2018, 3:18 am

≫ Next: ApplyVQSR page example error

≪ Previous: MergeBamAlignment without ALT contigs

Hello,
I would like to report a possible weird behaviour of GATK BaseRecalibrator.
During my analyses I follow the suggested "Best Practices", so after aligning I mark the duplicates (if needed) and always recalibrate.
Recently I wrote a python script that, given a bam file and the related bed, analyses the coverage in the "padding" regions upstream and downstream the regions of interest (exons) (fig.1).

The aim is to see how far beyond the exon, in both directions, the coverage stays above a certain threshold. Also a quality parameter "q" can be specified, so that any base with phred < q is not counted in the total coverage.

While doing this, I found out what looks like a strange bias. If I do not use the q parameter (so phred quality is not taken into account), the coverage level decreases gradually while we move away from the exon (fig.2), as expected.

If I use the q=30 parameter, though, I always observe a significant fall in coverage mainly in positions 2 and 3, both upstream and downstream; then the levels go back up and slowly decrease normally (fig.3).

This behaviour is never observed when the .bam file is NOT recalibrated. When I use the q=30 threshold on non-recalibrated bam, I do not detect any trouble (fig.4).

It looks like the recalibration process penalizes the base calls in those positions for some reason, and this can be verified by simply opening the recalibrated bam vs the non-recalibrated one in IGV.

When hovering on the bases at positions 2 and 3 (upstream or downstream with respect to the exon), a drop in quality can be noticed in the recal ones. The other positions are pretty much immune to this. One could argue that for some reason, the base calls in those particular positions have a quality score close to 30 before recalibration, and this process simply lowers them below our threshold. But it's not like that: before recalibration, all the positions flanking the exon have similar quality scores - pretty high in all the cases I analysed - so there's no implicit "disadvantage" in the starting quality for positions 2 and 3. It just seems that recalibration is particularly severe in those spots.

I tried to single out any possible confounding factor I could think of:
-- Tried several samples, coming from different runs and different points in time;
-- Tried runs from MiSeq, HiSeq, NextSeq;
-- Tried to use different versions of dbSNP as known sites, plus a high confidence set of SNPs from 1000Genomes;
-- Tried to use both GATK3 and GATK4

but no luck, the same behaivour persists. Do you have any clues?

Thanks

Mauro

Commands used for recalibration:
/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg19_ucsc_filtered.fa -I CA.bam -O CA.table --known-sites dbSNP_150_hg19_chr.vcf

/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-4.0.0.0-local.jar ApplyBQSR -R hg19_ucsc_filtered.fa -I CA.bam -bqsr CA.table -O CA_recal.bam

Figures:
*fig.1: visual description of the "padding" regions under analysis, in orange.
* fig.2: the coverage level for each exon while moving "away" from it in both directions, for 15bp. X axis: positions relative to the exon (negative=upstream, positive=downstream), Y axis: coverage
* fig.3: the same graph when intriducing the q=30 filtering threshold. Many exons become zero-covered in positions 2,3.
* fig.4: the graph when using the same q=30 threshold on a NON-recalibrated bam.

↧

ApplyVQSR page example error

June 9, 2018, 5:37 am

≫ Next: select private SNPs for individual samples from a multisample VCF

≪ Previous: Strange behaviour (bias?) in BaseRecalibrator

Dear team,
Thanks for the wonderful development of this tool.
When I tried out ApplyVQSR with GATK4, I was following the examples from https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_vqsr_ApplyVQSR.php,
and found the option --ts_filter_level in the example not recognised by GATK. Further reading on the page suggested the option should really be -ts-filter-level, and --ts_filter_level was a carry over from GATK 3.x. Can you please correct it?
Thanks,
Jing

↧

select private SNPs for individual samples from a multisample VCF

June 18, 2018, 11:44 am

≫ Next: Discrepancies of DepthOfCoverage report

≪ Previous: ApplyVQSR page example error

I'm trying to select private SNPs for each sample from a multisample VCF. Here is my workaround command:

java -Xmx16G -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R ${FASTA} \ -V ${VCF} \ -o ${SAMPLE}.snv.vcf \ -sn ${SAMPLE} \ --restrictAllelesTo BIALLELIC \ -select "AC==1" \ --keepOriginalAC
version: GenomeAnalysisTK-3.7-93-ge9d8068

However, the filtered private-SNPs output does not have private SNPs only. eg. AC_Orig is > 1 for most positions.
I'm wondering if there is something else I need to add or if you have any other suggestion to complete this task.

Many thanks,
Jose

↧

Discrepancies of DepthOfCoverage report

June 9, 2018, 10:04 pm

≫ Next: Mutect2 t_lod_fstar filter

≪ Previous: select private SNPs for individual samples from a multisample VCF

The DepthOfCoverage is a modulo of GATK 3.8.0
the command is:
java -Xmx32g -jar /share/data1/local/bin/GenomeAnalysisTK.jar -T DepthOfCoverage -R /share/data1/genome/hs38DH.fa -o coverage -I combined_realign.bam -L /share/data1/PublicProject/GATK_bundle/wgs_calling_regions.hg38.bed -ct 1 -ct 3 -ct 10 -ct 20 -omitBaseOutput

wgs_calling_regions.hg38.bed is a simple version from wgs_calling_regions.hg38.inveral_list, which is taken from GATK bundle

I also paste the sample_summay as below, the discrepancy is bwteen "granular_third_quartile granular_median granular_first_quartile" and "%_bases_above_1 %_bases_above_3 %_bases_above_10 %_bases_above_20"

Take the first sample for example, it has 61% of the region has a coverage above 3, however the granular_median is reported as 1, can the team kindly explain the discrepancies to me and which one I should believe?

sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_1 %_bases_above_3 %_bases_above_10 %_bases_above_20
Y87645825 10726716375 3.67 1 1 1 90.8 61.0 2.8 0.1
Y87646062 10236869174 3.50 1 1 1 90.1 59.7 1.9 0.1
Y87645848 11931990342 4.08 1 1 1 91.0 64.7 4.8 0.2
E229909 8072356330 2.76 1 1 1 87.8 47.3 0.6 0.1
E230084 8801018351 3.01 1 1 1 88.4 51.5 1.0 0.1
Y87645831 10232223584 3.50 1 1 1 90.9 60.0 1.8 0.1
E229884 30810889945 10.54 1 1 1 99.4 96.8 52.5 3.8
Y87645851 12119947525 4.15 1 1 1 90.5 64.7 5.5 0.2
Y87646049 10461827187 3.58 1 1 1 89.9 60.2 2.3 0.1
Total 113393838813 38.78 N/A N/A N/A

↧

Mutect2 t_lod_fstar filter

June 10, 2018, 11:18 am

≫ Next: How to remove the profile? (Forum)

≪ Previous: Discrepancies of DepthOfCoverage report

Hello,

I have breast cancer RNA-seq samples. When I create my vcf files by Mutect2, some of the variations have PASS value under FILTER column and some of them have "t_lod_fstar". Does it mean that the variation cannot pass t_lod_fstar parameter? Does Mutect2 filter out variations having t_lod_fstar value that is below a specific value? Should I filter out variations having "t_lod_fstar"?

And also all values in the QUAL column are '.' (having only a dot, not a number) Is this normal in Mutect2? I know that QUAL is one of the parameter to decide whether a variation PASSes. Isn't QUAL parameter needed to decide for PASSing of a variation?

I would be pleased if you can answer.

↧

How to remove the profile? (Forum)

June 10, 2018, 2:01 pm

≫ Next: Error: Cannot merge sequence dictionaries because sequence 2 and 19 are in different orders in two i

≪ Previous: Mutect2 t_lod_fstar filter

Hi all,
I want to delete my profile, but not find any option in the settings menu.

↧

Error: Cannot merge sequence dictionaries because sequence 2 and 19 are in different orders in two i

June 18, 2018, 12:49 pm

≫ Next: strange alignment in mutect2 output bam

≪ Previous: How to remove the profile? (Forum)

Hello all,

I am attempting to generate VCF from 22 bam files using HaplotypeCaller. I have prepared my reference file as per suggestion. When I run the command, I am constantly ending up with this error
MESSAGE: Cannot merge sequence dictionaries because sequence 2 and 19 are in different orders in two input sequence dictionaries.
I deleted .dict and .fai and regenerated but that didn't help. I removed sequence 2 and 19 from input bam files but even then i end up with the same message about sequence 2 and 19.
Here is my command line. Can you please suggest where I might be going wrong.

java -jar /home/sb47/Programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/sb47cp/Ref_5_71/Rattus_norvegicus.Rnor_5.0.71.dna_sm.toplevel.fa -I /scratch/sb47cp/global_seq_Rnorvegicus/Chineese/dedupR10_recal.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Chineese/dedupR12_recal.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/France/France8.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/France/France9.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iceland/Iceland1.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iceland/Iceland2.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iceland/Iceland3.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iran/Iran10.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iran/Iran11.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Iran/Iran12.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/north_china/HLJ1.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/north_china/HLJ2.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/north_china/HLJ4.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Norway/Norway1.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Norway/Norway2.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Norway/Norway4.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/south_china/GD2.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/south_china/GD3.sorted.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Germany/RWplus_rg.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Germany/RW_rg.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Rrattus/dedupRratUK_recal.bam -I /scratch/sb47cp/global_seq_Rnorvegicus/Rrattus/dedupRratUSA_recal.bam -o output.vcf.gz

↧

strange alignment in mutect2 output bam

May 21, 2018, 11:09 pm

≫ Next: GATK 4.0.5.0 Module help command does not work?

≪ Previous: Error: Cannot merge sequence dictionaries because sequence 2 and 19 are in different orders in two i

Hi, I use gatk4.0.0.0 and find a strange alignment in mutect2 outout bam, I think the red T on the right of indel seems to should be aligned to the left T in the indel.

↧

GATK 4.0.5.0 Module help command does not work?

June 10, 2018, 10:41 pm

≫ Next: Genotype all sites in MuTect2

≪ Previous: strange alignment in mutect2 output bam

As usual I like checking the help blobs from the original application itself but it seems to be broken for GATK 4.0.5.0.

gatk HaplotypeCaller -h throws exception.

Gokalps-Mac-mini:~ sky$ gatk HaplotypeCaller -h
Using GATK jar /Users/sky/scripts/gatk-package-4.0.5.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/sky/scripts/gatk-package-4.0.5.0-local.jar HaplotypeCaller -h
java.lang.IllegalArgumentException: Allowed values request for unrecognized string argument: input
    at org.broadinstitute.hellbender.cmdline.GATKPlugin.GATKAnnotationPluginDescriptor.getAllowedValuesForDescriptorHelp(GATKAnnotationPluginDescriptor.java:246)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.usageForPluginDescriptorArgumentIfApplicable(CommandLineArgumentParser.java:870)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.makeArgumentDescription(CommandLineArgumentParser.java:847)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.printArgumentUsage(CommandLineArgumentParser.java:791)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.lambda$printArgumentUsageBlock$2(CommandLineArgumentParser.java:276)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.printArgumentUsageBlock(CommandLineArgumentParser.java:276)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.usage(CommandLineArgumentParser.java:308)
    at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.parseArguments(CommandLineArgumentParser.java:417)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.parseArgs(CommandLineProgram.java:221)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

And this is what happens in 4.0.4.0

Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar HaplotypeCaller -h
USAGE: HaplotypeCaller [arguments]

Call germline SNPs and indels via local re-assembly of haplotypes
Version:4.0.4.0


Required Arguments:

--input,-I:String             BAM/SAM/CRAM file containing reads  This argument must be specified at least once.
                              Required.

--output,-O:String            File to which variants should be written  Required.

--reference,-R:String         Reference sequence file  Required.


Optional Arguments:

--activity-profile-out:String Output the raw activity profile results in IGV format  Default value: null.

--add-output-sam-program-record,-add-output-sam-program-record:Boolean
                              If true, adds a PG tag to created SAM/BAM/CRAM files.  Default value: true. Possible
                              values: {true, false}

--add-output-vcf-command-line,-add-output-vcf-command-line:Boolean
                              If true, adds a command line header line to created VCF files.  Default value: true.
                              Possible values: {true, false}

--alleles:FeatureInput        The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES
                              Default value: null.

--annotate-with-num-discovered-alleles:Boolean
                              If provided, we will annotate records with the number of alternate alleles that were
                              discovered (but not necessarily genotyped) at a given site  Default value: false. Possible
                              values: {true, false}

Same problem persists in docker version as well.

Thanks.

↧

Genotype all sites in MuTect2

June 1, 2018, 8:09 am

≫ Next: Access to the GATK bundle

≪ Previous: GATK 4.0.5.0 Module help command does not work?

Dear colleagues,
I've been trying different options to obtain genotypes for all sites, either variant or non-variant. No success, I only get the variant sites. Went through the options several times and do not see what am I doing wrong. I am running gatk-4.0.4.0. This is my latest attempt, putting together --output-mode EMIT_ALL_SITES and --all-site-pls true

`
gatk Mutect2 \
-R hs37d5.fa \
-I tumour.bam \
-tumor tumsample \
-I bulk.bam \
-normal bulksample \
--germline-resource gnomad.exomes.r2.0.2.sites.vcf.bgz \
-O calls.vcf.gz \
-L 1:1-100000 \
--all-site-pls true \
--output-mode EMIT_ALL_SITES \
--af-of-alleles-not-in-resource 0.00003125

Thanks for any hint

↧

Access to the GATK bundle

June 11, 2018, 7:42 am

≫ Next: About java threads in GATK

≪ Previous: Genotype all sites in MuTect2

Hi there!

I'm trying to get access to the GATK resource bundle, but neither the direct link to that (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/) nor the ftp server seems to work at the moment.
It would be really great if you could look into it.
Thanks for the help and for the patience.

Best,
Matteo

↧

About java threads in GATK

June 1, 2018, 5:55 am

≫ Next: (Howto) Run GATK4 in a Docker container

≪ Previous: Access to the GATK bundle

Since GATK is based on Java, and java is known for spawning multiple threads for many GATK applications like Haplotypecaller, CombineGVCF, GenotypeGVCF, GenomicsDBimport and so on, there are ways to control such thread spawning when you have limited computing resource. In some GATK forums, we have been advised to use -XX:ConcGCThreads and also -XX:ParallelGCThreads. Since their usage requires in-depth understanding of how java multithreading works, it is difficult for me to understand when to use what. Can someone from the GATK developers team explain which -XX method should be used?

↧

(Howto) Run GATK4 in a Docker container

December 3, 2017, 7:56 pm

≫ Next: GermlineCNVCaller --interval-merging-rule error.

≪ Previous: About java threads in GATK

1. Install Docker

Follow the relevant link below depending on your computer system; on Mac and Windows, select the "Stable channel" download. Run through the installation instructions and initial setup page; they are very straightforward and should only take you a few minutes (not counting download time).
We have included instructions below for all steps after that first page, so you shouldn't need to go to any other pages in the Docker documentation. Frankly their docs are targeted at people who want to do things like run web applications on the cloud and can be quite frustrating to deal with.

Click here for Mac

Click here for Windows

Full list of supported systems and their install pages

2. Get the GATK4 container image

Go to your Terminal (it doesn't matter where your working directory is) and run the following command.

docker pull broadinstitute/gatk:4.beta.6

Note that the last bit after gatk: is the version tag, which you can change to get a different version than what we've specified here.

The GATK4 image is quite large so the download may take a little while if you've never done this before. The good news is that next time you need to pull a GATK4 image (e.g. to get another release), Docker will only pull the components that have been updated, so it will go faster.

3. Start up the GATK4 container

There are several different ways to do this in Docker. Here we're going to use the simplest invocation that gets us the functionality we need, i.e. the ability to log into the container once it's running and execute commands from inside it.

docker run -it broadinstitute/gatk:4.beta.6

If all goes well, this will start up the container in interactive mode, and you will automatically get logged into it. Your terminal prompt will change to something like this:

root@ea3a5218f494:/gatk#

At this point you can use classic shell commands to explore the container and see what's in there, if you like.

4. Run a GATK4 command in the container

The container has the gatk-launch script all set up and ready to go, so you can now run any GATK or Picard command you want. Note that if you want to run a Picard command, you need to use the new syntax, which follows GATK conventions (-I instead of I= and so on). Let's use --list to list all tools available in this version.

./gatk-launch --list

The output will start with a usage message (shown below) then a full list of tools and their summary descriptions.

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
Running:
    /gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Once you've verified that this works for you, you know you can run any GATK4 commands you want. But before you proceed, there's one more setup thing to go through, which is technically optional but will make your life much easier.

5. Use a mounted volume to access data that lives outside the container

This is the final piece of the puzzle. By default, when you're inside the container you can't access any data that lives on the filesystem outside of the container. One way to deal with that is to copy things back and forth, but that's wasteful and tedious. So we're going to follow the better path, which is to mount a volume in the container, i.e. establish a link that makes part of the filesystem visible from inside the container.

The hitch is that you can't do this after you started running the container, so you'll have to shut it down and run a new one (not just restart the first one) with an extra part to the command. In case you're wondering why we didn't do this from the get-go, it's because the first command we ran is simpler so there's less chance that something will go wrong, which is nice when you're trying something for the first time.

To shut down your container from inside it, you can just type exit while still inside the container:

exit

That should stop the container and take you back to your regular prompt. It's also possible to exit the container without stopping it (a move called detaching) but that's a matter for another time since here we do want to to stop it. You'll probably also want to learn how to clean up and delete old instances of containers that you no longer want.

For now, let's focus on starting a new instance of the GATK4 container, specifying in the following command what is your particular container ID and the filesystem location you want to mount.

docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:4.beta.6

Here I set the external location to be an existing directory called my_project in my home directory (the key requirement is that it has to be an absolute path) and I'm setting the mount point inside the container's /gatk directory. The name of the mount point can be the same as the mount directory, or something completely different; the main constraint is that it should not conflict with an existing directory, otherwise that would make the existing directory unattainable.

Assuming your paths are valid, this command starts up the container and logs you into it the same way as before; but now you can see by using ls that you have access to your filesystem. So now you can run GATK commands on any data you have lying around. Have fun!

↧

GermlineCNVCaller --interval-merging-rule error.

March 29, 2018, 6:22 am

≫ Next: Missing variants from vcf to gvcf

≪ Previous: (Howto) Run GATK4 in a Docker container

Hi
I was testing the brand new GermlineCNVCaller in 4.0.3.0 however I met a very strange error.

All my read collections was made with the following command

gatk CollectReadCounts -R $HG19FULL --interval-merging-rule OVERLAPPING_ONLY -L $TSOREG -I samplename_final.bam -O samplename_counts.hdf5

And there was no problem with DetermineGermlineContigPloidy step. All files were generated using GATK 4.0.3.0 docker image fresh from docker repo.

DetermineGermlineContigPloidy command was according to the doc files within the gatk folder. I have 32 samples and I was working in COHORT mode.

Here is the error message

Where is the problem here?

↧

Missing variants from vcf to gvcf

June 11, 2018, 10:54 am

≫ Next: Difference in GenotypeGVCFs generated VCF after consolidation with GenomicsDBimport and CombineGVCF

≪ Previous: GermlineCNVCaller --interval-merging-rule error.

Hello,
I work with complete sequences of Y chromosome of NGS. I'm creating a GVCF multisample from 24 single vcfs. Once I created the GVCF multisample, I realize that for 11 samples I'm missing variants. As seen in the example, from column 10 to 20 shouldn't give 0 since it's a variant present in the singles vcfs of those samples. What might be going wrong?

Y 28670117 . T C 9746.79 . AC=12;AF=1.00;AN=12;DP=239;FS=0.000;MLEAC=12;MLEAF=1.00;MQ=59.41;QD=31.70;SOR=0.894 GT:AD:DP:GQ:PL .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 .:0,0 1:0,10:10:99:322,0 1:0,20:20:99:853,0 1:0,22:22:99:916,0 1:0,25:25:99:1023,0 1:0,25:25:99:1041,0 1:0,18:18:99:749,0 1:0,36:36:99:1418,0 1:0,12:12:99:523,0 1:0,30:30:99:310,0 1:0,9:9:99:294,0 1:0,3:3:99:105,0 1:0,28:28:99:1217,0.:0,0

This is the command that I used:

java -jar /home/GATK/GenomeAnalysisTK.jar -R /home/hgref_human_b37_ChrY/human_g1k_v37_decoy.fasta -T GenotypeGVCFs -o S.genotypeGVCF.vcf -allSites --variant sample1.haplotypecallerGVCF.g.vcf --variant sample2.haplotypecallerGVCF.g.vcf --variant allsamples.haplotypecallerGVCF.g.vcf > S.genotypeGVCF.log 2>&1

Thanks!

↧

Difference in GenotypeGVCFs generated VCF after consolidation with GenomicsDBimport and CombineGVCF

June 11, 2018, 2:52 pm

≫ Next: Seems like CombineGVCFs is freezing

≪ Previous: Missing variants from vcf to gvcf

Hi,

I had a set of total 81 GVCFs that I first consolidated using GenomicsDBimport and then using CombineGVCF and then GenotypeGVCF was run in both cases. For GenomicsDBimport, I ran the command per contig and then I ran GenotypeGVCF on each database to get the final VCF file. Then I used Picard GatherGVCF to make the final VCF. The commands I used are wriiten below:

Using GenomicsDBimport:
java -Xmx90g -jar gatk-package-4.0.4.0-local.jar GenomicsDBImport -R water_buffalo_re_arranged_chrom_ref_genome.fa --TMP_DIR ./tmp --sample-name-map sample_names_map_new.txt --reader-threads 2 --genomicsdb-workspace-path "$contig" -L "$contig"

java -Xmx8G -XX:ConcGCThreads=1 -jar gatk-package-4.0.4.0-local.jar GenotypeGVCFs -R /water_buffalo_re_arranged_chrom_ref_genome.fa -new-qual -V gendb://"$contig" -O "$contig"_variants.vcf.gz

java -jar picard.jar GatherVcfs INPUT=list.txt OUTPUT=Final_med_buffalo_variants_81_samples.vcf.gz

Using CombineGVCF:
java -Xmx200g -XX:ConcGCThreads=1 -jar gatk-package-4.0.4.0-local.jar CombineGVCFs -R water_buffalo_re_arranged_chrom_ref_genome.fa --variant All_gvcf_gz.list -O combined_81.g.vcf.gz

java -Xmx8G -XX:ConcGCThreads=1 -jar gatk-package-4.0.4.0-local.jar GenotypeGVCFs -R water_buffalo_re_arranged_chrom_ref_genome.fa -new-qual -V combined_81.g.vcf.gz -O Final_variants_81_samples_using_CombineGVCF.vcf.gz

The final VCF in both the cases should be the same. Unfortunately, it was not. On running bcftools isec, I found that some variants were common to one VCF and some were in other. What could be the reason behind this discrepancy?

Kindly let me know if you need more information.

↧

Seems like CombineGVCFs is freezing

April 16, 2018, 9:40 am

≫ Next: CNVDiscoveryPipeline fails at Stage5 -- no warn messages

≪ Previous: Difference in GenotypeGVCFs generated VCF after consolidation with GenomicsDBimport and CombineGVCF

Hello,

I am using latest version of gatk (gatk-4.0.3.0), it seems like CombineGVCFs is required before GenotypeGVCFs for multiple samples according to the online Tool Documentation, so I am running CombineGVCFs after SNP calling. My problem is CombineGVCFs is running very slowly, and looks like it has been freezing after read input files (I only have 6 samples) as below:

11:33:49.467 INFO CombineGVCFs - Initializing engine
11:33:56.507 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/151_RWG1_assembly.g.vcf
11:34:02.884 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/179_RWG1_assembly.g.vcf
11:34:08.362 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/338_RWG1_assembly.g.vcf
11:34:13.497 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/374_RWG1_assembly.g.vcf
11:34:22.429 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/449_RWG1_assembly.g.vcf
11:34:27.592 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/RWG1_control_RWG1_assembly.g.vcf

Is there anyone could help or explain this issue? Thank you very much!

Best,
Yuanwen

↧

CNVDiscoveryPipeline fails at Stage5 -- no warn messages

June 18, 2018, 3:14 pm

≫ Next: direct downloading URL for GATK3.8?

≪ Previous: Seems like CombineGVCFs is freezing

Hi I'm using Genome STRiP CNVDiscoveryPipeline (v2.00.1833) on WGS data from a collection of inbred maize lines. I have populated the metadata directory to the best of my ability and was able to get SVPreprocess to complete successfully in 3 batches. All files in the metadata directory seem sensible with the exception of sample_gender_report.txt, which is blank except for the header.

I am running into difficulty within the CNVDiscoveryPipeline at Stage5. The error message is:

INFO  15:49:25,058 QJobsReporter - Writing JobLogging GATKReport to file /panfs/roc/groups/14/hirschc1/pmonnaha/CNVDiscoveryPipeline.jobreport.txt 
INFO  15:49:25,083 QJobsReporter - Plotting JobLogging GATKReport to file /panfs/roc/groups/14/hirschc1/pmonnaha/CNVDiscoveryPipeline.jobreport.pdf 
WARN  15:49:26,351 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info. 
INFO  15:49:26,352 QCommandLine - Done with errors 
INFO  15:49:26,353 QGraph - ------- 
INFO  15:49:26,354 QGraph - Failed:   'java'  '-Xmx14336m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/panfs/roc/groups/14/hirschc1/pmonnaha/.queue/tmp'  '-cp' '/home/hirschc1/pmonnaha/software/svtoolkit/lib/SVToolkit.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/home/hirschc1/pmonnaha/software/svtoolkit/lib/SVToolkit.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.gatk.queue.QCommandLine'  '-cp' '/home/hirschc1/pmonnaha/software/svtoolkit/lib/SVToolkit.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/Queue.jar'  '-S' '/home/hirschc1/pmonnaha/software/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStage5.q'  '-S' '/home/hirschc1/pmonnaha/software/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStageBase.q' '-S' '/home/hirschc1/pmonnaha/software/svtoolkit/qscript/discovery/cnv/CNVDiscoveryGenotyper.q'  '-S' '/home/hirschc1/pmonnaha/software/svtoolkit/qscript/SVQScript.q'  '-gatk' '/home/hirschc1/pmonnaha/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar'  '-jobLogDir' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/cnv_stage5/logs'  '-memLimit' '14.0'  '-jobRunner' 'Drmaa'  '-gatkJobRunner' 'Drmaa'  '-jobNative' '-l walltime=24:00:00'  -run  '-runDirectory' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/cnv_stage5'  '-sentinelFile' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/cnv_sentinel_files/stage_5.sent'  --disableJobReport  '-configFile' '/home/hirschc1/pmonnaha/software/svtoolkit/conf/genstrip_parameters.txt'  '-P' 'depth.parityCorrectionThreshold:null'  '-R' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_chr1-10.fasta'  '-ploidyMapFile' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_chr1-10.ploidymap.txt'  '-genderMapFile' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-0/sample_gender.report.txt' '-genderMapFile' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-1/sample_gender.report.txt' '-genderMapFile' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-2/sample_gender.report.txt'  '-md' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-0' '-md' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-1' '-md' '/home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-2'  -disableGATKTraversal  '-I' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/bam_headers/merged_headers.bam'  '-vpsReportsDirectory' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/cnv_stage4'  '-selectedSamplesList' '/panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/cnv_stage5/eval/DiscoverySamples.list'  
INFO  15:49:26,354 QGraph - Log:     /panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/logs/CNVDiscoveryPipeline-44.out 
INFO  15:49:26,354 QCommandLine - Script failed: 61 Pend, 0 Run, 1 Fail, 43 Done 
------------------------------------------------------------------------------------------
Done. ------------------------------------------------------------------------------------------

However, the log file CNVDiscoveryPipeline44.out says 'There were no warn messages'. Furthermore, it seems that the actual error happened earlier on the the pipeline. The Stage3 merged.sites.vcf files only contain the headers and no variant information. Within the Stage2 results, several files seem incorrect: the ClusterSeparation.report.dat file has NA in all columns except for ID, the GenotypeLikelihoodStats file is empty and so is the VariantsPerSample and SelectedVariants file. Oddly, the log files for Stage2 all say 'There were no warn messages'. An example of output from Stage1 looks like:

CHROM POS ID REF ALT QUAL FILTER INFO

chr2 1 CNV_chr2_1_1000 A . . END=1000;SVTYPE=CNV
chr2 500 CNV_chr2_500_1500 T . . END=1500;SVTYPE=CNV
chr2 1000 CNV_chr2_1000_2000 A . . END=2000;SVTYPE=CNV
chr2 1500 CNV_chr2_1500_2500 G . . END=2500;SVTYPE=CNV
chr2 2000 CNV_chr2_2000_3000 T . . END=3000;SVTYPE=CNV
chr2 2500 CNV_chr2_2500_3500 T . . END=3500;SVTYPE=CNV

The log fails for Stage1 also do not point to any errors. Does anyone have an idea as to what is going wrong? Or where should I be looking to track down the error?

My job script is:

module load java/jdk1.8.0_144
module load samtools
module load htslib/1.6
module load R/3.3.3
module load libdrmaa/1.0.13

SV_DIR="/home/hirschc1/pmonnaha/software/svtoolkit"
export LD_LIBRARY_PATH=${SV_DIR}:${LD_LIBRARY_PATH}
export SV_DIR
export PATH=${SV_DIR}:${PATH}
export LD_LIBRARY_PATH=/panfs/roc/msisoft/libdrmaa/1.0.13/lib/:${LD_LIBRARY_PATH}

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
java -Xmx14g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R /home/hirschc1/pmonnaha/misc-files/gstrip/W22_chr1-10.fasta \
     -I /home/hirschc1/pmonnaha/misc-files/gstrip/W22_E2_Bams.txt \
     -md /home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-0 \
     -md /home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-1 \
     -md /home/hirschc1/pmonnaha/misc-files/gstrip/W22_MetaData_E2-2 \
     -runDirectory /panfs/roc/scratch/pmonnaha/Maize/gstrip/w22 \
     -jobLogDir /panfs/roc/scratch/pmonnaha/Maize/gstrip/w22/logs \
     -jobRunner Drmaa \
     -gatkJobRunner Drmaa \
     -P depth.parityCorrectionThreshold:null \
     -tilingWindowSize 1000 \
     -tilingWindowOverlap 500 \
     -maximumReferenceGapLength 1000 \
     -boundaryPrecision 100 \
     -minimumRefinedLength 500 \
     -retry 10 \
     -memLimit 14 \
     -startFromScratch \
     -jobNative '-l walltime=24:00:00' \
     -run

↧

direct downloading URL for GATK3.8?

June 19, 2018, 1:57 am

≫ Next: Algorithm question for VQSR

≪ Previous: CNVDiscoveryPipeline fails at Stage5 -- no warn messages

Hello,

I saw GATK3.8 allows direct downloading now without the requirement of registration. So I was wondering if it is possible to provide me a direct downloading URL for GATK3.8. I am asking this because I used GATK3 as one of dependencies for a couple of software pipelines that I am developing and it will be great if my software pipeline can handle the GATK3.8 installation automatically without asking the users to manually download it. Thanks for consideration!

Best,
Jia-Xing

↧