9.1 The nf-core/bactmap pipeline

Teaching: 60 min || Exercises: 30 min

Overview

Questions:

What is nf-core/bactmap?
How can I produce a multiple sequence alignment of TB sequences?

Learning Objectives:

Use nf-core/bactmap to produce a multiple sequence alignment.

Keypoints:

To build a phylogenetic tree we need a multiple sequence alignment of the sequences we want to infer a tree from.
Alignments are usually done against a reference genome.
We can use the pipeline nf-core/bactmap to produce a multiple sequence alignment.

9.1 nf-core/bactmap

nf-core/bactmap is a bioinformatics best-practice analysis pipeline for mapping short reads from bacterial WGS to a reference sequence, creating filtered VCF files, making pseudogenomes based on high quality positions in the VCF files and optionally creating a phylogeny from an alignment of the pseudogenomes. It makes use of many of the tools and techniques we’ve covered so far this week.

nf-core/bactmap variant calling pipeline diagram from nf-core (https://nf-co.re/bactmap).

9.2 TB dataset

We will be analysing a dataset of 83 TB genomes previously analysed by Prince as part of his PhD. As we don’t have access to the Noguchi HPC yet, we’re only going to analyse two of the genomes for now. Don’t worry, we’ll go through the results for all 83 samples

9.3 Activate the `nextflow` environment

Now navigate into the 09-bactmap/ directory and activate the nextflow environment:

cd ~/Desktop/workshop_files_Bact_Genomics_2023/09-bactmap/

mamba activate nextflow

Disk Usage I — Before analysis

Before we run nf-core/bactmap, let’s pause and check the space of our current working directory as we did for our previous lesson.

You can do this with the disk usage du command

du -h

Current Disk Space In 09-bactmap Directory

~247MB

Now, keep this value in mind, and this time, don’t forget it. We will come back to it at the end of this chapter.

9.4 The MTBC ancestral sequence

The reference sequence we’ll use to map our TB samples back to is one commonly used in the MTBC community and is known as the MTBC ancestral sequence. It was generated as part of Comas et al. 2013 (link to paper). This sequence was created by building an alignment and phylogenetic tree of the known MTBC genomes at the time and inferring what the most likely sequence of the ancestor of the MTBC was. It is a useful sequence to use when comparing different members of the MTBC as it should reduce the amount of bias towards any particular reference that is more closely related to one lineage than another.

9.5 Setting up the Nextflow config file

As we learnt in the nf-core pipelines module, we can create a config file that tells Nextflow which parameters to use e.g. the job submission system. We’ve create a basic template that will allow you to run nf-core/bactmap on the teaching computers/HPC. The config file looks like this:

params {
  config_profile_description = 'Noguchi HPC profile.'
  config_profile_contact = 'Prince Asare (PAsare@noguchi.ug.edu.gh)'
  config_profile_url = ""
}

process {
  executor = 'local'
}

conda {
  enabled = true
  conda.useMamba = true
}

params {
  max_memory = 7.GB
  max_cpus = 4
  max_time = 12.h
}

9.6 Run nf-core/bactmap to generate an alignment

The first step in building a phylogenetic tree is to generate a multiple sequence alignment. To do this, we’re going to map the sequence data for our 83 TB genomes to a reference, in this case the MTBC inferred ancestral sequence, using the nf-core/bactmap pipeline.

Generate a samplesheet

Now create a samplesheet.csv file containing the sample IDs and the location of the files to be mapped:

wget -L https://raw.githubusercontent.com/avantonder/bovisanalyzer/main/bin/fastq_dir_to_samplesheet.py

python fastq_dir_to_samplesheet.py \
    data \
    samplesheet.csv \
    -r1 _1.fastq.gz \
    -r2 _2.fastq.gz

The meaning of the options used is:

Input option	Input required	Description
`data`	DIRECTORY	Directory containing the fastq files
`samplesheet.csv`	FILENAME	Name of the input file for `nf-core/bactmap`
`-r1`	FILE SUFFIX	Suffix of the Read 1 FASTQ file
`-r2`	FILE SUFFIX	Suffix of the Read 2 FASTQ file*

* The fastq_dir_to_samplesheet.py script will use everything before this to create the sample IDs

As we don’t want to run all the samples today, we’ll extract the top two samples from the samplesheet.csv file:

head -n 3 samplesheet.csv > samplesheet_test.csv

Run nf-core/bactmap

Now run nf-core/bactmap with Nextflow to generate a reference-based alignment:

nextflow run nf-core/bactmap \
    -profile conda \
    -c teaching.conf \
    --input samplesheet_test.csv \
    --reference reference/MTB_ANC_unwrapped.fasta \
    --genome_size 4.3mb \
    --outdir results/bactmap

nf-core/bactmap has a number of optional arguments but for now these are the ones we’re going to use:

Input option	Description
`-profile conda`	Create `conda` environments for each tool in the pipeline (we could also use `Singularity` or `Docker`)
`-c teaching.conf`	A confuguration file containing the settings needed to run the pipeline on your machines
`--input samplesheet_test.csv`	The samplesheet we created above containing the sample ids and location of the fastq files
`--reference reference/MTB_ANC_unwrapped.fasta`	The reference sequence we’re going to map our samples to
`--genome_size 4.3mb`	This is used by the pipeline to calculate the approximate genome coverage in the fastq files. By default the pipeline uses a tool called `rasusa` to subsample the fastq files so the genome coverage is <= 100X
`--outdir results/bactmap`	The directory we’re going to save the outputs from `nf-core/bactmap` to

Visit the nf-core/bactmap page for further information on running the pipeline with different options.

If the pipeline has started successfully, you should see something like this:

N E X T F L O W  ~  version 22.04.3
Launching `https://github.com/nf-core/bactmap` [special_kare] DSL2 - revision: e83f8c5f0e [master]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/bactmap v1.0.0
------------------------------------------------------
Core Nextflow options
  revision                  : master
  runName                   : special_kare
  containerEngine           : singularity
  launchDir                 : /rds/project/rds-PzYD5LltalA/Teaching/Ghana
  workDir                   : /rds/project/rds-PzYD5LltalA/Teaching/Ghana/work
  projectDir                : /home/ajv37/.nextflow/assets/nf-core/bactmap
  userName                  : ajv37
  profile                   : singularity
  configFiles               : /home/ajv37/.nextflow/assets/nf-core/bactmap/nextflow.config, /rds/project/rds-PzYD5LltalA/Teaching/Ghana/cambridge.config

Input/output options
  input                     : samplesheet.csv
  outdir                    : bactmap_results

Compulsory parameters
  reference                 : MTB_ANC_unwrapped.fasta

Optional pipeline steps
  adapter_file              : /home/ajv37/.nextflow/assets/nf-core/bactmap/assets/adapters.fas
  genome_size               : 4.3mb

Max job request options
  max_cpus                  : 56
  max_memory                : 192 GB
  max_time                  : 12h

Institutional config options
  config_profile_description: Cambridge HPC cluster profile.
  config_profile_contact    : Andries van Tonder (ajv37@cam.ac.uk)
  config_profile_url        : https://docs.hpc.cam.ac.uk/hpc

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/bactmap for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/bactmap/blob/master/CITATIONS.md
------------------------------------------------------
executor >  slurm (2)
executor >  slurm (2)
[3f/931dc1] process > NFCORE_BACTMAP:BACTMAP:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)   [  0%] 0 of 1
[59/74ef4b] process > NFCORE_BACTMAP:BACTMAP:BWA_INDEX (MTB_ANC_unwrapped.fasta)               [  0%] 0 of 1
[-        ] process > NFCORE_BACTMAP:BACTMAP:FASTP                                             -
[-        ] process > NFCORE_BACTMAP:BACTMAP:SUB_SAMPLING:RASUSA                               -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BWA_MEM                                           -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT                   -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BAM_SORT_SAMTOOLS:SAMTOOLS_INDEX                  -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_... -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_... -
[-        ] process > NFCORE_BACTMAP:BACTMAP:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_... -
[-        ] process > NFCORE_BACTMAP:BACTMAP:VARIANTS_BCFTOOLS:BCFTOOLS_MPILEUP                -
[-        ] process > NFCORE_BACTMAP:BACTMAP:VARIANTS_BCFTOOLS:BCFTOOLS_FILTER                 -
[-        ] process > NFCORE_BACTMAP:BACTMAP:VCF2PSEUDOGENOME                                  -
[-        ] process > NFCORE_BACTMAP:BACTMAP:ALIGNPSEUDOGENOMES                                -
[-        ] process > NFCORE_BACTMAP:BACTMAP:SNPSITES                                          -
[-        ] process > NFCORE_BACTMAP:BACTMAP:GET_SOFTWARE_VERSIONS                             -
[-        ] process > NFCORE_BACTMAP:BACTMAP:MULTIQC                                           -

The pipeline will take a while to run so we’ll have a look at the results after lunch.