7.1 Managing Software

Teaching: 60 min || Exercises: 20 min

Overview

Questions:
  • How do I install and manage software on my computer or a HPC?
Learning Objectives:
  • Understand what a package manager is, and how it can be used to manage software installation on a on a personal computer or HPC environment.
  • Install the Conda package manager.
  • Create a software environment and install software using Conda.
  • Understand that mamba is a faster, more efficient version of Conda
Keypoints:
  • To install your own software, you can use the Conda package manager.
  • Conda allows you to have separate “software environments”, where multiple package versions can co-exist on your system.
  • Use conda env create <ENV> to create a new software environment and conda install -n <ENV> <PROGRAM> to install a program in that environment.
  • Use conda activate <ENV> to “activate” the software environment and make all the programs installed there available.
  • Conda is a very useful tool but can be slow, so increasingly mamba is being used instead.

7.1 The conda Package Manager

Often you may want to use software packages that are not be installed by default on your computer or the compute cluster you may have access to. There are several ways you could manage your own software installation, but in this module we will be using Conda, which gives you access to a large number of scientific packages.

There are two main software distributions that you can download and install, called Anaconda and Miniconda.
Miniconda is a lighter version, which includes only base Python, while Anaconda is a much larger bundle of software that includes many other packages (see the Documentation Page for more information).

One of the strengths of using Conda to manage your software is that you can have different versions of your software installed alongside each other, organised in environments. Organising software packages into environments is extremely useful, as it allows to have a reproducible set of software versions that you can use and resuse in your projects.

Illustration of Conda environments.

Installing Conda (Optional)

To start with, let’s install Conda on your machine. In this course we will install the Miniconda bundle, as it’s lighter and faster to install:

  1. Make sure you’re logged in to the HPC and in the home directory (cd ~).
  2. Download the Miniconda installer by running:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  1. Run the installation script just downloaded:
bash Miniconda3-latest-Linux-x86_64.sh
  1. Follow the installation instructions accepting default options (answering ‘yes’ to any questions)
  2. Run the following command:
conda config --add channels defaults; conda config --add channels bioconda; conda config --add channels conda-forge; conda config --set channel_priority strict

This adds two channels (sources of software) useful for bioinformatics and data science applications.

Tip

Anaconda and Miniconda are also available for Windows and Mac OS. See the Conda Installation Documents for instructions.

Installing Software Using Conda

The command used to install and manage software is called conda. Although we will only cover the basics in this course, it has an excellent documentation and a useful cheatsheet.

The first thing to do is to create a software environment for our project. Although this is optional (you could instead install everything in the “base” default environment), it is a good practice as it means the software versions remain stable within each project.

To create an environment we use:

conda create --name ENV

Where “ENV” is the name we want to give to that environment. Once the environment is created, we can install packages using:

conda install --name ENV PROGRAM

Where “PROGRAM” is the name of the software we want to install.

Tip

One way to organise your software environments is to create an environment for each kind of analysis that you might be doing regularly. For example, you could have an environment named bioinformatics with software that you use for analysing sequence data (e.g. fastQC, bwa) and another called processing with software you use for downstream analysis of your results (e.g. Python’s pandas). Alternatively, you can create a separate environment for each tool you’d like to install and use.

To search for the software packages that are available through conda:

  • go to anaconda.org.
  • in the search box search for a program of your choice. For example: “bowtie2”.
  • the results should be listed as CHANNEL/PROGRAM, where CHANNEL will the the source channel from where the software is available. Usually scientific/bioinformatics software is available through the conda-forge and bioconda channels.

If you need to install a program from a different channel than the defaults, you can specify it during the install command using the -c option. For example conda install --channel CHANNEL --name ENV PROGRAM.

Let’s see this with an example, where we create a new environment called “scipy” and install the python scientific packages:

conda create --name scipy
conda install --name scipy --channel conda-forge numpy matplotlib

To see all the environments you have available, you can use:

conda info --env
# conda environments:
#
base                  *  /home/participant36/miniconda3
scipy                    /home/participant36/miniconda3/envs/scipy

In our case it lists the base (default) environment and the newly created scipy environment. The asterisk (“*“) tells us which environment we’re using at the moment.

Loading Conda Environments

Once your packages are installed in an environment, you can load that environment by using conda activate ENV, where “ENV” is the name of your environment. For example, we can activate our previously created environment with:

conda activate scipy

If you check which python executable is being used now, you will notice it’s the one from this new environment:

which python
~/miniconda3/envs/scipy/bin/python

You can also check that the new environment is in use from:

conda env list
# conda environments:
#
base                     /home/participant36/miniconda3
scipy                 *  /home/participant36/miniconda3/envs/scipy

And notice that the asterisk “*” is now showing we’re using the scipy environment.

7.2 Replacing conda with Mamba

conda is an amazing tool which has completely changed the way we install tools, removing the worst of the hassle around ensuring that dependencies are installed alongside the tools. However, conda can be quite slow and sometimes has difficulties resolving the dependencies. Fortunately, Mamba was developed to speed up the process. Mamba is a reimplementation of the conda package manager in C++ and allows you to use exactly the same commands as conda (simply replace conda with mamba). Unlike, the tools we’ve been installing above in their own environments, Mamba needs to be installed in the conda base environment. Let’s try installing it now:

conda install mamba -n base -c conda-forge

Once it’s installed let’s try creating a new environment called bakta and then install the Bakta annotation tool we’re going to use in the Assembly and annotation module.

mamba create -n bakta 

mamba activate bakta

mamba install -c bioconda bakta

Let’s check that we successfully installed bakta:

bakta -h

If you get the bakta help output, congratulations you’ve successfully installed a tool with Mamba.

You can see the commands we use with Mamba are exactly the same as conda but the output looks very different. Don’t worry it’s doing exactly the same thing as conda but better!

Exercise 7.1.1: Create a conda environment

In the workshop_files_Bact_Genomics_2023/07_software_management folder, you will find the same files we used in the short read mapping module. Your objective will be to prepare to align the sequences to the MTB H37v reference genome, using a software called bowtie2.

First, we need to prepare our genome for this alignment procedure (this is referred to as indexing the genome).

  1. Create a new conda environment named “bowtie”.
  2. Install the bowtie2 program in your new environment.
  3. Check that the software installed correctly by running which bowtie2 and bowtie2 --help.
  4. Index the reference genome with bowtie2
  1. To create a new conda environment we run:
mamba create --name bowtie2
  1. If we search for this software on the Anaconda website, we will find that it is available via the bioconda channel: https://anaconda.org/bioconda/bowtie2

We can install it on our environment with:

mamba install --name bowtie2 --channel bioconda bowtie2
  1. First we need to activate our environment:
mamba activate bioinformatics

Then, if we run bowtie2 --help, we should get the software help printed on the console.

  1. Now let’s index our reference genome with bowtie2
bowtie2-build MTB_H37Rv.fasta index

We should get several output files in the 07_software_management directory with an extension “.bt2”:

ls 07_software_management
index.1.bt2
index.2.bt2
index.3.bt2
index.4.bt2
index.rev.1.bt2
index.rev.2.bt2

Further resources

Credit

Information on this page has been adapted and modified from the following source(s): https://github.com/cambiotraining/hpc-intro-sanger/blob/main/04-software.md