Some tools in GATK4, like the gCNV pipeline and the new deep learning variant filtering tools, require extensive Python dependencies. To avoid having to worry about managing these dependencies, we recommend using the GATK4 docker container, which comes with everything pre-installed, as explained here. If you are running GATK4 on a server and/or cannot use the Docker image, we recommend using the Conda package manager as a backup solution. The Conda package manager comes with all the dependencies you need, so you do not need to install everything separately. Both Conda and Docker are intended to solve the same problem, but one of the big differences/benefits of Conda is that you can use Conda without having root access. Conda should be easy to install if you follow these steps.
1) Refer to the installation instructions from Conda. Choose the correct version/computer you need to download it for. You will have the option of downloading Anaconda or Miniconda. Conda provides documentation about the difference between Anaconda and Miniconda. We chose to use Miniconda for this tutorial because we just wanted to use the GATK package and did not want to take up too much space on our computer. If you are not going to use Conda for anything other than GATK4, you might consider doing the same. If you choose to install Anaconda, you may have access to other bioinformatics packages that are helpful to you, and you won’t have to install each package you need. Follow the prompts to properly install the .pkg file. Make sure you choose the correct package for the version of Python you are using. For example, if you have Python 2.7 on your computer, choose the version specific to it.
2) Go to the directory where you have stored the GATK4 jars and the gatk
wrapper script, and make sure gatkcondaenv.yml is present. Run
conda env create -n gatk -f gatkcondaenv.yml
source activate gatk
3) To check if your Conda environment is running properly, type conda list
and you should see a list of packages installed.
gatkpythonpackages
should be one of them.
4) You can also test out whether the new variant filtering tool (CNNScoreVariants) runs properly. If you run
python -c "import vqsr_cnn"
the output should look like Using TensorFlow backend.
. If you do not have the Conda environment configured correctly, you will get an error immediately saying ImportError: No module named vqsr_cnn
.
5) If you later upgrade to a new version of GATK4, you will need to update the Conda configuration in the new GATK4 folder. If you simply overwrite the old GATK with the new one, you will get an error message saying “CondaValueError: prefix already exists: /anaconda2/envs/gatk”. For example, when I upgraded from GATK 4.0.1.2 to GATK 4.0.2.0, I simply ran (in my 4.0.2.0 folder)
source deactivate
conda env remove -n gatk
Then, follow Steps 2-4 again to re-install it.