6.1 Assembly and Annotation

Teaching: 50 min || Exercises: 20 min

Overview

Questions:
  • What is de novo genome assembly?
  • How do I assemble my genome?
  • How do I assess the quality of my genome?
  • What is genome annotation?
  • How do I annotate my genome?
Learning Objectives:
  • Understand what de novo genome assembly is and how it differs from reference-based assembly.
  • Assemble sequence data with shovill
  • Generate metrics for your assembly with QUAST.
  • Annotate your genome with prokka
Keypoints:
  • Understand what de novo genome assembly is and how we can use shovill to create one.
  • Extract and interprete the assembly metrics using QUAST to decide how good your assembly is.
  • Understand what genome annotation is and how these can be added using prokka.

6.1 Background

There are two approaches for genome assembly: reference-based (or comparative) or de novo. In a reference-based assembly, we use a reference genome as a guide to map our sequence data to and thus reassemble our sequence this way (this is what we did in the previous module). Alternatively, we can create a ‘new’ (de novo) assembly that does not rely on a map or reference and more closely reflects the actual genome structure of the isolate that was sequenced.

Genome assembly

Genome assemblers

Several tools are available for de novo genome assembly depending on whether you’re trying to assemble short-read sequence data, long reads or else a combination of both. Two of the most commonly used assemblers for short-read Illumina data are Velvet and SPAdes. SPAdes has become the de facto standard de novo genome assembler for Illumina whole genome sequencing data of bacteria and is a major improvement over previous assemblers like Velvet. However, some of its components can be slow and it traditionally did not handle overlapping paired-end reads well. Shovill is a pipeline which uses SPAdes at its core, but alters the steps before and after the primary assembly step to get similar results in less time. Shovill also supports other assemblers like SKESA, Velvet and Megahit.

Disk Usage I — Before analysis

Before we start performing any assemblies, let’s pause and check the space of our current working directory as we did for our previous lesson.

You can do this with the disk usage du command

du -h
Current Disk Space In assembly_annotation_MTB Directory ~247MB
Now, keep this value in mind, and this time, don’t forget it. We will come back to it at the end of this chapter.

6.2 de novo genome assembly

de novo genome assembly with Shovill

We’re going to use Shovill to create our de novo genome assemblies. Let’s navigate the module directory and activate the shovill environment:


cd workshop_files_Bact_Genomics_2023/06_assembly_annotation 

mamba activate shovill
Help

Do this to get the help information for Shovill:

shovill -h
SYNOPSIS
  De novo assembly pipeline for Illumina paired reads
USAGE
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
GENERAL
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
INPUT
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 150)
  --gsize XXX     Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '')
OUTPUT
  --outdir XXX    Output folder (default: '')
  --force         Force overwite of existing output folder (default: OFF)
  --minlen N      Minimum contig length <0=AUTO> (default: 0)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --keepfiles     Keep intermediate files (default: OFF)
RESOURCES
  --tmpdir XXX    Fast temporary directory (default: '')
  --cpus N        Number of CPUs to use (0=ALL) (default: 8)
  --ram n.nn      Try to keep RAM usage below this many GB (default: 16)
ASSEMBLER
  --assembler XXX Assembler: megahit skesa velvet spades (default: 'spades')
  --opts XXX      Extra assembler options in quotes eg. spades: '--sc' (default: '')
  --kmers XXX     K-mers to use <blank=AUTO> (default: '')
MODULES
  --trim          Enable adaptor trimming (default: OFF)
  --noreadcorr    Disable read error correction (default: OFF)
  --nostitch      Disable read stitching (default: OFF)
  --nocorr        Disable post-assembly correction (default: OFF)
HOMEPAGE
  https://github.com/tseemann/shovill - Torsten Seemann

We’ve chosen one of the TB genomes from our course dataset to assemble with Shovill: ERX1275297_ERR1203055 (TBNm 2359).

Usage

The basic command for running Shovill is as follows:

shovill --outdir <OUTDIR> --R1 <FWD FASTQ> --R2 <REV FASTQ> --gsize <GENOME SIZE>

The meaning of the options used is:

Input option Input required Description
–outdir DIRECTORY directory to write the output files to
–R1 FILE Read 1 FASTQ file
–R2 FILE Read 2 FASTQ file
–gsize TXT The estimated genome size that shovill will use to calculate read coverage
Run shovill

Let’s build our first de novo assembly with shovill:

shovill --outdir ERX1275297_ERR1203055 --R1 ERX1275297_ERR1203055_1.fastq.gz --R2 ERX1275297_ERR1203055_2.fastq.gz --gsize 4.3M
[shovill] Hello ajv37
[shovill] You ran: /home/ajv37/miniconda3/envs/shovill/bin/shovill --outdir ERX1275297_ERR1203055 --R1 ERX1275297_ERR1203055_1.fastq.gz --R2 ERX1275297_ERR1203055_2.fastq.gz --gsize 4.3M
[shovill] This is shovill 1.1.0
[shovill] Written by Torsten Seemann
[shovill] Homepage is https://github.com/tseemann/shovill
[shovill] Operating system is linux
[shovill] Perl version is v5.32.1
[shovill] Machine has 64 CPU cores and 187.36 GB RAM
[shovill] Using bwa - /home/ajv37/miniconda3/envs/shovill/bin/bwa | Version: 0.7.17-r1188
[shovill] Using flash - /home/ajv37/miniconda3/envs/shovill/bin/flash | FLASH v1.2.11
[shovill] Using java - /home/ajv37/miniconda3/envs/shovill/bin/java | openjdk version "11.0.8-internal" 2020-07-14
[shovill] Using kmc - /home/ajv37/miniconda3/envs/shovill/bin/kmc | K-Mer Counter (KMC) ver. 3.2.1 (2022-01-04)
[shovill] Using lighter - /home/ajv37/miniconda3/envs/shovill/bin/lighter | Lighter v1.1.2
[shovill] Using megahit - /home/ajv37/miniconda3/envs/shovill/bin/megahit | MEGAHIT v1.2.9
[shovill] Using megahit_toolkit - /home/ajv37/miniconda3/envs/shovill/bin/megahit_toolkit | v1.2.9
[shovill] Using pigz - /home/ajv37/miniconda3/envs/shovill/bin/pigz | pigz 2.6
[shovill] Using pilon - /home/ajv37/miniconda3/envs/shovill/bin/pilon | Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500
[shovill] Using samclip - /home/ajv37/miniconda3/envs/shovill/bin/samclip | samclip 0.4.0
[shovill] Using samtools - /home/ajv37/miniconda3/envs/shovill/bin/samtools | Version: 1.16.1 (using htslib 1.16)
[shovill] Using seqtk - /home/ajv37/miniconda3/envs/shovill/bin/seqtk | Version: 1.3-r106
[shovill] Using skesa - /home/ajv37/miniconda3/envs/shovill/bin/skesa | SKESA 2.4.0
...

The following files will be created in the ERX1275297_ERR1203055 directory:

Output file Description
contigs.fa The final assembly you should use
shovill.log Full log file for bug reporting
shovill.corrections List of post-assembly corrections
contigs.gfa Assembly graph
spades.fasta Raw assembled contigs

Before we move on to the next step, let’s rename our assembly (contigs.fa) to something more useful:

cd ERX1275297_ERR1203055

mv contigs.fa ERX1275297_ERR1203055_contigs.fa

Finally, deactivate the shovill environment:

mamba deactivate

6.3 Assembly quality assessment

Assembly QC with QUAST

Before we do any further analyses with our assemblies, we need to assess the quality. To do this we use a tool called QUAST which generates a number of useful metrics for assemblies. These include but aren’t limited to:

  1. The total number of contigs greater than 0 bp. Ideally, we like to see assemblies in the smallest number of contigs (‘pieces’) as this means there is likely less missing data contained in gaps between contigs.
  2. The total length of the assembly. We normally know what the expected length of our assembly is (for more diverse organisms like E. coli there may be a range of genome sizes). The length of your assembly should be close to the expected genome size. If it’s too big or too small, either there is some kind of contamination or else you’ve sequenced the wrong species!
  3. The N50 of the assembly. This is the final metric often used to assess the quality of an assembly. The N50 is typically calculated by averaging the length of the largest contigs that account for 50% of the genome size. This is a bit more complicated conceptually but the higher the N50 the better. A large N50 implies that you have a small number of larger contigs in your assembly which equals a good assembly.

Let’s activate the quast environment:

mamba activate quast
Help

Do this to get the help information for QUAST:

quast -h
QUAST: Quality Assessment Tool for Genome Assemblies
Version: 5.0.2

Usage: python /home/ajv37/.conda/envs/quast/bin/quast [options] <files_with_contigs>

Options:
-o  --output-dir  <dirname>       Directory to store all result files [default: quast_results/results_<datetime>]
-r                <filename>      Reference genome file
-g  --features [type:]<filename>  File with genomic feature coordinates in the reference (GFF, BED, NCBI or TXT)
                                  Optional 'type' can be specified for extracting only a specific feature type from GFF
-m  --min-contig  <int>           Lower threshold for contig length [default: 500]
-t  --threads     <int>           Maximum number of threads [default: 25% of CPUs]

Advanced options:
-s  --split-scaffolds                 Split assemblies by continuous fragments of N's and add such "contigs" to the comparison
-l  --labels "label, label, ..."      Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
-L                                    Take assembly names from their parent directory names
-e  --eukaryote                       Genome is eukaryotic (primarily affects gene prediction)
    --fungus                          Genome is fungal (primarily affects gene prediction)
    --large                           Use optimal parameters for evaluation of large genomes
                                      In particular, imposes '-e -m 3000 -i 500 -x 7000' (can be overridden manually)
-k  --k-mer-stats                     Compute k-mer-based quality metrics (recommended for large genomes)
                                      This may significantly increase memory and time consumption on large genomes
    --k-mer-size                      Size of k used in --k-mer-stats [default: 101]
    --circos                          Draw Circos plot
-f  --gene-finding                    Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)
    --mgm                             Use MetaGeneMark for gene prediction (instead of the default finder, see above)
    --glimmer                         Use GlimmerHMM for gene prediction (instead of the default finder, see above)
    --gene-thresholds <int,int,...>   Comma-separated list of threshold lengths of genes to search with Gene Finding module
                                      [default: 0,300,1500,3000]
    --rna-finding                     Predict ribosomal RNA genes using Barrnap
-b  --conserved-genes-finding         Count conserved orthologs using BUSCO (only on Linux)
    --operons  <filename>             File with operon coordinates in the reference (GFF, BED, NCBI or TXT)
    --est-ref-size <int>              Estimated reference size (for computing NGx metrics without a reference)
    --contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]
-u  --use-all-alignments              Compute genome fraction, # genes, # operons in QUAST v1.* style.
                                      By default, QUAST filters Minimap's alignments to keep only best ones
-i  --min-alignment <int>             The minimum alignment length [default: 65]
    --min-identity <float>            The minimum alignment identity (80.0, 100.0) [default: 95.0]
-a  --ambiguity-usage <none|one|all>  Use none, one, or all alignments of a contig when all of them
                                      are almost equally good (see --ambiguity-score) [default: one]
    --ambiguity-score <float>         Score S for defining equally good alignments of a single contig. All alignments are sorted 
                                      by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are 
                                      discarded. S should be between 0.8 and 1.0 [default: 0.99]
    --strict-NA                       Break contigs in any misassembly event when compute NAx and NGAx.
                                      By default, QUAST breaks contigs only by extensive misassemblies (not local ones)
-x  --extensive-mis-size  <int>       Lower threshold for extensive misassembly size. All relocations with inconsistency
                                      less than extensive-mis-size are counted as local misassemblies [default: 1000]
    --scaffold-gap-max-size  <int>    Max allowed scaffold gap length difference. All relocations with inconsistency
                                      less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]
    --unaligned-part-size  <int>      Lower threshold for detecting partially unaligned contigs. Such contig should have
                                      at least one unaligned fragment >= the threshold [default: 500]
    --skip-unaligned-mis-contigs      Do not distinguish contigs with >= 50% unaligned bases as a separate group
                                      By default, QUAST does not count misassemblies in them
    --fragmented                      Reference genome may be fragmented into small pieces (e.g. scaffolded reference) 
    --fragmented-max-indent  <int>    Mark translocation as fake if both alignments are located no further than N bases 
                                      from the ends of the reference fragments [default: 85]
                                      Requires --fragmented option
    --upper-bound-assembly            Simulate upper bound assembly based on the reference genome and reads
    --upper-bound-min-con  <int>      Minimal number of 'connecting reads' needed for joining upper bound contigs into a scaffold
                                      [default: 2 for mate-pairs and 1 for long reads]
    --est-insert-size  <int>          Use provided insert size in upper bound assembly simulation [default: auto detect from reads or 255]
    --plots-format  <str>             Save plots in specified format [default: pdf].
                                      Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz
    --memory-efficient                Run everything using one thread, separately per each assembly.
                                      This may significantly reduce memory consumption on large genomes
    --space-efficient                 Create only reports and plots files. Aux files including .stdout, .stderr, .coords will not be created.
                                      This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built
-1  --pe1     <filename>              File with forward paired-end reads (in FASTQ format, may be gzipped)
-2  --pe2     <filename>              File with reverse paired-end reads (in FASTQ format, may be gzipped)
    --pe12    <filename>              File with interlaced forward and reverse paired-end reads. (in FASTQ format, may be gzipped)
    --mp1     <filename>              File with forward mate-pair reads (in FASTQ format, may be gzipped)
    --mp2     <filename>              File with reverse mate-pair reads (in FASTQ format, may be gzipped)
    --mp12    <filename>              File with interlaced forward and reverse mate-pair reads (in FASTQ format, may be gzipped)
    --single  <filename>              File with unpaired short reads (in FASTQ format, may be gzipped)
    --pacbio     <filename>           File with PacBio reads (in FASTQ format, may be gzipped)
    --nanopore   <filename>           File with Oxford Nanopore reads (in FASTQ format, may be gzipped)
    --ref-sam <filename>              SAM alignment file obtained by aligning reads to reference genome file
    --ref-bam <filename>              BAM alignment file obtained by aligning reads to reference genome file
    --sam     <filename,filename,...> Comma-separated list of SAM alignment files obtained by aligning reads to assemblies
                                      (use the same order as for files with contigs)
    --bam     <filename,filename,...> Comma-separated list of BAM alignment files obtained by aligning reads to assemblies
                                      (use the same order as for files with contigs)
                                      Reads (or SAM/BAM file) are used for structural variation detection and
                                      coverage histogram building in Icarus
    --sv-bedpe  <filename>            File with structural variations (in BEDPE format)

Speedup options:
    --no-check                        Do not check and correct input fasta files. Use at your own risk (see manual)
    --no-plots                        Do not draw plots
    --no-html                         Do not build html reports and Icarus viewers
    --no-icarus                       Do not build Icarus viewers
    --no-snps                         Do not report SNPs (may significantly reduce memory consumption on large genomes)
    --no-gc                           Do not compute GC% and GC-distribution
    --no-sv                           Do not run structural variation detection (make sense only if reads are specified)
    --no-gzip                         Do not compress large output files
    --no-read-stats                   Do not align reads to assemblies
                                      Reads will be aligned to reference and used for coverage analysis,
                                      upper bound assembly simulation, and structural variation detection.
                                      Use this option if you do not need read statistics for assemblies.
    --fast                            A combination of all speedup options except --no-check

Other:
    --silent                          Do not print detailed information about each step to stdout (log file is not affected)
    --test                            Run QUAST on the data from the test_data folder, output to quast_test_output
    --test-sv                         Run QUAST with structural variants detection on the data from the test_data folder,
                                      output to quast_test_output
-h  --help                            Print full usage message
-v  --version                         Print version

Online QUAST manual is available at http://quast.sf.net/manual
Usage

The basic command for running QUAST is as follows:

quast.py --output-dir <OUTDIR> <ASSEMBLY>  

The meaning of the option used is:

Input option Input required Description
–output-dir DIRECTORY Directory to write the output files to
Run QUAST

Let’s run QUAST on the genome we’ve just created to assess the quality:

quast.py --output-dir quast ERX1275297_ERR1203055_contigs.fa
Version: 5.0.2

System information:
  OS: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-redhat-7.9-Nitrogen (linux_64)
  Python version: 3.7.12
  CPUs number: 64

Started: 2022-12-15 16:44:12

Logging to /rds/project/rds-PzYD5LltalA/Teaching/Ghana/shovill/ERX1275297_ERR1203055.1/quast/quast.log
NOTICE: Maximum number of threads is set to 16 (use --threads option to set it manually)

CWD: /rds/project/rds-PzYD5LltalA/Teaching/Ghana/shovill/ERX1275297_ERR1203055.1
Main parameters: 
  MODE: default, threads: 16, minimum contig length: 500, minimum alignment length: 65, \
  ambiguity: one, threshold for extensive misassembly size: 1000
...

The following files will be created in the quast directory:

Output file Description
report.txt Assessment summary in plain text format
report.tsv Tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc)
report.tex LaTeX version of the summary
icarus.html Icarus main menu with links to interactive viewers
report.pdf All other plots combined with all tables (file is created if matplotlib python library is installed)
report.html HTML version of the report with interactive plots inside
transposed_report.txt Transposed assessment summary in plain text format
transposed_report.tsv Transposed tab-separated version of the summary
transposed_report.tex Transposed LaTeX version of the summary
Exercise 6.1.1: Investigate the QUAST output

Now that we’ve got the results from QUAST, how good is the assembly we created with shovill?

  1. What is the total number of contigs?
  2. What is the total length of the assembly?
  3. What is the N50 of the assembly?

There are a couple of ways we could answer the questions above as the assembly metrics can be parsed in different ways. Let’s parse the report.tsv:

cat report.tsv

This will provide all the information we need:

Assembly    ERX1275297_ERR1203055_contigs
# contigs (>= 0 bp) 227
# contigs (>= 1000 bp)  86
# contigs (>= 5000 bp)  61
# contigs (>= 10000 bp) 55
# contigs (>= 25000 bp) 48
# contigs (>= 50000 bp) 35
Total length (>= 0 bp)  4397844
Total length (>= 1000 bp)   4361978
Total length (>= 5000 bp)   4311192
Total length (>= 10000 bp)  4270901
Total length (>= 25000 bp)  4162429
Total length (>= 50000 bp)  3732349
# contigs   105
Largest contig  298563
Total length    4376331
GC (%)  65.58
N50 99158
N75 64010
L50 15
L75 28
# N's per 100 kbp   0.00
  1. The total number of contigs > 0bp is 227
  2. The total length of the assembly is 4397844
  3. The N50 of the assembly is 99158

What do you think? Should we move forward with this assembly? The answer is yes!

Finally, deactivate the quast environment:

mamba deactivate

6.4 Genome annotation

Genome annotation is a multi-level process that includes prediction of protein-coding genes (CDSs), as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements. The most commonly used tools for annotating bacterial genomes are Prokka and, more recently, Bakta. Both use a tool called prodigal to predict the protein-coding regions along with other tools for predicting other genomic features such as Aragorn for tRNA. Once the genomic regions have been predicted, the tools use a database of existing bacterial genome annotations, normally generated from a large collection of genomes such as UniRef, to add this information to your genome.

Annotation with Prokka

Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files (GenBank, EMBL and gff) for further analysis or viewing in genome browsers. It is suitable for annotating de novo assemblies of bacteria, but not appropriate for human genomes (or any other eukaryote).

Before we run Prokka, let’s activate the environment:

mamba activate prokka
Help

Do this to get the help information for Prokka:

prokka --help
Name:
  Prokka 1.14.6 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
  rapid bacterial genome annotation
Usage:
  prokka [options] <contigs.fasta>
General:
  --help             This help
  --version          Print version and exit
  --citation         Print citation for referencing Prokka
  --quiet            No screen output (default OFF)
  --debug            Debug mode: keep all temporary files (default OFF)
Setup:
  --dbdir [X]        Prokka database root folders (default '/home/ajv37/miniconda3/envs/prokka2/db')
  --listdb           List all configured databases
  --setupdb          Index all installed databases
  --cleandb          Remove all database indices
  --depends          List all software dependencies
Outputs:
  --outdir [X]       Output folder [auto] (default '')
  --force            Force overwriting existing output folder (default OFF)
  --prefix [X]       Filename output prefix [auto] (default '')
  --addgenes         Add 'gene' features for each 'CDS' feature (default OFF)
  --addmrna          Add 'mRNA' features for each 'CDS' feature (default OFF)
  --locustag [X]     Locus tag prefix [auto] (default '')
  --increment [N]    Locus tag counter increment (default '1')
  --gffver [N]       GFF version (default '3')
  --compliant        Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]       Sequencing centre ID. (default '')
  --accver [N]       Version to put in Genbank file (default '1')
Organism details:
  --genus [X]        Genus name (default 'Genus')
  --species [X]      Species name (default 'species')
  --strain [X]       Strain name (default 'strain')
  --plasmid [X]      Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]      Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]        Genetic code / Translation table (set if --kingdom is set) (default '0')
  --prodigaltf [X]   Prodigal training file (default '')
  --gram [X]         Gram: -/neg +/pos (default '')
  --usegenus         Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]     FASTA or GBK file to use as 1st priority (default '')
  --hmms [X]         Trusted HMM to first annotate from (default '')
  --metagenome       Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct       Do not clean up /product annotation (default OFF)
  --cdsrnaolap       Allow [tr]RNA to overlap CDS (default OFF)
Matching:
  --evalue [n.n]     Similarity e-value cut-off (default '1e-09')
  --coverage [n.n]   Minimum coverage on query protein (default '80')
Computation:
  --cpus [N]         Number of CPUs to use [0=all] (default '8')
  --fast             Fast mode - only use basic BLASTP databases (default OFF)
  --noanno           For CDS just set /product="unannotated protein" (default OFF)
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --rfam             Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna           Don't run rRNA search (default OFF)
  --notrna           Don't run tRNA search (default OFF)
  --rnammer          Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
Usage

The basic command for running Prokka is as follows:

prokka --outdir <OUTPUT DIRECTORY> --prefix <OUTPUT PREFIX> <ASSEMBLY>

The meaning of the options used is:

Input option Input required Description
–outdir DIRECTORY Name of output directory
–prefix PREFIX Prefix for output files
Run Prokka

Let’s run Prokka on the genome we created earlier to do produce annotation files for this sample:

prokka --prefix ERX1275297_ERR1203055 --outdir ERX1275297_ERR1203055 ERX1275297_ERR1203055_contigs.fa
[18:38:38] This is prokka 1.14.6
[18:38:38] Written by Torsten Seemann <torsten.seemann@gmail.com>
[18:38:38] Homepage is https://github.com/tseemann/prokka
[18:38:38] Local time is Tue Jan 24 18:38:38 2023
[18:38:38] You are ajv37
[18:38:38] Operating system is linux
[18:38:38] You have BioPerl 1.7.8
Argument "1.7.8" isn't numeric in numeric lt (<) at /home/ajv37/miniconda3/envs/prokka2/bin/prokka line 259.
[18:38:38] System has 64 cores.
[18:38:38] Will use maximum of 8 cores.
...

The following files will be created:

Output file Description
ERX1275297_ERR1203055.tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product
ERX1275297_ERR1203055.gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
ERX1275297_ERR1203055.gbk This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
ERX1275297_ERR1203055.sqn An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
ERX1275297_ERR1203055.fna Nucleotide FASTA file of the input contig sequences.
ERX1275297_ERR1203055.ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
ERX1275297_ERR1203055.faa Protein FASTA file of the translated CDS sequences.
ERX1275297_ERR1203055.fsa Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
ERX1275297_ERR1203055.tbl Feature Table file, used by “tbl2asn” to create the .sqn file.
ERX1275297_ERR1203055.txt Statistics relating to the annotated features found.
ERX1275297_ERR1203055.log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
ERX1275297_ERR1203055.err Unacceptable annotations - the NCBI discrepancy report.

The file format most commonly used for downstream analyses such as manual curation or pan-genome analysis is the gff file. We covered gff files back in Common File Formats so you may want to remind yourself what’s inside them.

Exercise 6.1.2: Investigate the Prokka output

We can extract summary statistics for Prokka by examining the output files.

  1. How many coding regions (CDS) did Prokka predict?
  2. How many rRNA genes were identified?
  3. How many repeat regions were identified?

The easiest way to pull out the summary statistics is to examine the ERX1275297_ERR1203055.txt file:

cat ERX1275297_ERR1203055.txt
organism: Genus species strain 
contigs: 227
bases: 4397844
CDS: 4076
rRNA: 3
repeat_region: 3
tRNA: 52
tmRNA: 1
  1. Prokka predicted 4076 CDSs in ERX1275297_ERR1203055
  2. 3 rRNA genes were identified
  3. 3 repeat_regions were identified

Finally, deactivate the prokka environment:

mamba deactivate
Disk Usage II — Cleaning up after analysis

Now that we are done investigating our assembling and annotating our genome, let’s pause again and check the space of our current working directory.

You can do this with the disk usage du command

du -h

How much disk space have you used since the start of the analysis?

Credit

Information on this page has been adapted and modified from the following source(s): https://github.com/Joseph7e/MDIBL-T3-WGS-Tutorial#genome-assembly