3.1 Common File Formats
Teaching: 20 min || Exercises: 10 min
Overview
3.1.1 Background
In this chapter, our aim will be to go over the most common bioinformatics file formats you will come across. We will however not go into much details, but to simply help you become familiarized with the different file types that you may encounter in the course and are frequently used to store biological data.
The heading of each file format links to a wiki page with more details about the format.
Generally, files can be classified into two categories: text files and binary files.
Text files can be opened with standard text editors, and manipulated using command-line tools (such as
head
,less
,grep
,cat
, etc.). However, many of the standard files listed in this page can be opened with specific software that displays their content in a more user-friendly way. For example, the NEWICK format is used to store phylogenetic trees and, although it can be opened in a text editor, it is better used with a software such as FigTree to visualise the tree as a graph. Very often, text files may be compressed to save storage space. A common compression format used in bioinformatics is gzip which has extension.gz
. Many bioinformatic tools support compressed files. For example, FASTQ files (used to store NGS sequencing data) are often compressed with format.fq.gz
.Binary files are often used to store data more efficiently. This helps improves efficiency for computers and are usually in a non-human readable binary format. You’ll see some binary files have a corresponding “index” file which is useful for searching. Typically, specific tools need to be used with those files. For example, the BAM format is used to store sequences aligned to a reference genome and can be manipulated with dedicated software such as
samtools
.
More generally, the extensions
(last set of letters after the final .
in a file name) your find in file types are also used to indicate which algorithm/program can be used to open the file.
3.1.2 CSV and TSV
CSV stands for Comma-Separated Values
TSV on the other hand stands for Tab-Separated Values
CSV and TSV are text files. They store tabular data in a text file with the file extensions: .csv
and .tsv
respectively. It is also possible to store them with .txt
extension and the data wll work normally. As the names suggest, for CSV files there is a comma ,
between each value, whereas for TSV files there is a tab between each value.
Below is a sample .csv
text. You will see this being used more frequently in other chapters.
The simplicity of these formats allows researchers to easily exchange data among computers, a term known as portability.
These files can be opened with spreadsheet programs (such as Microsoft Excel). They can also be created from spreadsheet programs by going to File > Save As… and select “CSV (Comma delimited)” or “Text (Tab delimited)” as the file format.
3.1.3 FASTA
FASTA (pronounced “fast-A”) format is a simple type of format that bioinformaticians use to represent either DNA or protein sequences. FASTA files are therefore text files and stores nucleotide or amino acid sequences. It is written in text format, allowing for processing tools to easily parse the data. Within the basic FASTA file, there are two lines per sequence
- the identifier (comments, annotations)
- the sequence itself
The top line holds information pertaining to the sequence below. It is preceded by with a >
. Without this informative first line, we just have a raw format. As a general rule, it is recommended that each line of sequence be shorter than 80 characters. The file extensions for FASTA are .fa
or .fas
or .fasta
A FASTA file containing a single nucleotide sequence might look like this:
… and one containing a protein sequence may look like this:
Specific filename extensions
The generic form of FASTA file has the .fas
extension. For more specific types, we can use the following:
Extension | Meaning | Notes |
---|---|---|
fna | FASTA nucleic acid | Used generically to specify nucleic acids |
ffn | FASTA nucleotide coding regions | Contains coding regions for a genome |
faa | FASTA amino acid | Contains amino acid sequences. A multiple protein fasta file can have the more specific extension mpfa |
frn | FASTA non-coding RNA | Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA |
3.1.4 FAST5
FAST5 is a binary file. More specifically, it is a Hierarchical Data Format (HDF5) file. used by Nanopore platforms to store the called sequences (in FASTQ format) as well as the raw electrical signal data from the pore. It’s file extension is .fast5
.
3.1.5 FASTQ
The FASTA format is extremely simple with just two lines per sequence — the first is for the description, the other for the raw sequence. As this format is nicely simple, it does not tell us the Quality of each nucleotide or protein. Your guess is as good as mine, yes! the Q in FASTQ (pronounced “fast-Q”) stands for the Quality of each nucleotide or amino acid in a sequence. FASTQ is therefore a text file, but often compressed with gzip and stores sequences and their quality scores. The quality score is encoded with a single ASCII character. The file extensions for FASTQ are .fq
or .fastq
(compressed as .fq.gz
or .fastq.gz
) Just like the FASTA, the first line in a FASTQ file contains the sequence identifier with an optional description, however, the line begins with an @
character rather than a >
character.
A FASTQ file containing a single sequence might look like any of these:
The second line contains the actual sequences of the read which is separated from the fourth (Quality) line by a +
on the third line.
The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2
The byte representing quality runs from 0x21 (lowest quality; ‘!’ in ASCII) to 0x7e (highest quality; ‘~’ in ASCII). Here are the quality value characters in left-to-right increasing order of quality (ASCII):
3.1.6 SAM, BAM and CRAM
Before we talk about SAM, BAM and CRAM, we must discuss the software, SAMtools, from which these formats originate.
SAMtools
SAMtools is a suite of utilities that allow for efficient post-processing of short DNA sequence read alignments. The program includes several command line programs such as view, sort, and index that allow for next-generation sequence data processing. You will come across this in later chapters.
SAM format
The name SAM comes from Sequence Alignment/MAP. It is a text file with the file extensions .sam
. In addition to regular sequence reads, SAM includes alignment data that link short reads to a reference sequence. This makes SAM files the choice of format when visualizing short read sequences in genome browsers such as IGV (Integrated Genome Viewer).
BAM and CRAM
BAM refers to Binary Alignment Map. As the name suggest, its a binary file. It is same as a SAM file but compressed in binary form with the file extension .bam
. The SAM format is simple to parse, generate and check for errors. However, its large file size (usually in GB on average) gets in the way of efficiency. Thus, researchers found a way to compress it into a binary format without losing the ability to manipulate it. BAM contains indexable representation of nucleotide sequence alignments, allowing for intensive data processing in production pipelines.
CRAM is a restructured version of its binary version, with column-orientation.
What Information is in SAM & BAM
SAM files and BAM files contain the same information, but in a different format. Both SAM & BAM files contain an optional header section (beginning with the @
symbol) followed by the alignment section.
The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information.
The alignment section contains the information for each sequence about where/how it aligns to the reference genome.
Alignment sections have 11 mandatory fields, as well as a variable number of optional fields
Col | Field | Type | Brief description |
---|---|---|---|
1 | QNAME | String | Query template NAME |
2 | FLAG | Int | bitwise FLAG |
3 | RNAME | String | References sequence NAME |
4 | POS | Int | 1- based leftmost mapping POSition |
5 | MAPQ | Int | MAPping Quality |
6 | CIGAR | String | CIGAR string |
7 | RNEXT | String | Ref. name of the mate/next read |
8 | PNEXT | Int | Position of the mate/next read |
9 | TLEN | Int | observed Template LENgth |
10 | SEQ | String | segment SEQuence |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 |
The bitwise FLAG, provides the following information: * are there multiple fragments? * are all fragments properly aligned? * is this fragment unmapped? * is the next fragment unmapped? * is this query the reverse strand? * is the next fragment the reverse strand? * is this the 1st fragment? * is this the last fragment? * is this a secondary alignment? * did this read fail quality controls? * is this read a PCR or optical duplicate?
3.7 VCF
VCF stands for Variant Calling Format. As you may already guess from the name, yes!, it is a text file but often compressed with gzip that stores gene sequence variations (SNP/Indel variants). It’s file extension is .vcf
. It contains a header with metadata preceded by a ##
string. Best practices with VCF files recommend describing INFO, FILTER, and FORMAT entries used in the body within the header.
Following the header is the body, made up of 8 mandatory tab separated columns, one for each identifier and an unlimited number of optional columns that may be used to record other information about the sample(s). When additional columns are used, the first optional column is used to describe the format of the data in the columns that follow.
The columns of a VCF
Name | Brief description | |
---|---|---|
1 | CHROM | The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as ‘the reference sequence’, i.e. the sequence against which the given sample varies. |
2 | POS | The 1-based position of the variation on the given sequence. |
3 | ID | The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a “.”. Multiple identifiers should be separated by semi-colons without white-space. |
4 | REF | The reference base (or bases in the case of an indel) at the given position on the given reference sequence. |
5 | ALT | The list of alternative alleles at this position. |
6 | QUAL | A quality score associated with the inference of the given alleles. |
7 | FILTER | A flag indicating which of a given set of filters the variation has failed or PASS if all the filters were passed successfully. |
8 | INFO | An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: |
9 | FORMAT | An (optional) extensible list of fields for describing the samples. See below for some common fields. |
+ | SAMPLEs | For each (optional) sample described in the file, values are given for the fields listed in FORMAT. |
NB. You may have come across a BCF
file. Binary Call Format (BCF) is a binary representation of VCF, containing the same information in binary format for improved performance.
3.1.8 BED
BED stands for Browser Extensible Data. It’s a text file that stores coordinates of genomic regions and has the file extension .bed
. ìt has a tabs-delimited file format that allows users to define how data lines of an annotation track are displayed. One of the advantages of this format is the manipulation of coordinates instead of nucleotide sequences, which optimizes the power and computation time when comparing all or part of genomes.
BED files can have up to 12 columns, but only three are required for the UCSC browser, Galaxy browser and bedtools: chrom - Name of chromosome - chr5, chrX, chr2_random. or scaffold - scaffold10671 chromStart - Starting position of chrom. - First base starts at 0. chromEnd - Ending position. - This value does not get displayed. For example, the first 20 bases would have chromStart value of 0 to and chromEnd value of 20.
3.1.9 GFF
GFF stands for General Feature Format (also called gene-finding format, generic feature format). It’s a text file that stores gene coordinates and other features and has the file extension .gff
. It is used to describe genes and other features of DNA, RNA and protein sequences.
GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments and it consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file. It uses a header region with a ##
string to include metadata.
General GFF3 structure
Position index | Position name | Description |
---|---|---|
1 | seqid | The name of the sequence where the feature is located. |
2 | source | Keyword identifying the source of the feature, like a program (e.g. Augustus or RepeatMasker) or an organization (like TAIR). |
3 | type | The feature type name, like “gene” or “exon”. In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent “transcript” feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project. |
4 | start | Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED. |
5 | end | Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed] |
6 | score | Numeric value that generally indicates the confidence of the source in the annotated feature. A value of “.” (a dot) is used to define a null value. |
7 | strand | Single character that indicates the strand of the feature; it can assume the values of “+” (positive, or 5’->3’), “-”, (negative, or 3’->5’), “.” (undetermined). |
8 | phase | phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or “.” (for everything else). See the section below for a detailed explanation. |
9 | attributes | All the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between the three competing file formats. |
You may have come across a GTF
file. The GTF (Gene Transfer Format) file type shares the same format as GFF files, though it is used to define gene and transcript-related features exclusively.
3.1.10 NEWICK
The NEWICK format is a text file usually with the file extensions: .tree
or .treefile
. It is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It stores phylogenetic trees including nodes names and edge lengths.
The following tree:
could be represented in Newick format in several ways
(,,(,)); no nodes are named
(A,B,(C,D)); leaf nodes are named
(A,B,(C,D)E)F; all nodes are named
(:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent
(:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular)
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names
((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node (rare)
Below is the newick tree data of the first tree you will ever build in this course
Myfirsttree.tree
(TBNmA399:0.008270245,TBNmA464:0.007078265,((TBNmA217:0.009068723,((TBNmA361:0.008676792,TBNmA250:0.007952614)0.882:0.000644627,((ERX467883:0.019205883,(((TBNmA286:0.008210194,(TBNmA162:0.000672847,(TBNmA432:0.000681896,TBNmA437:0.000906203)0.520:0.000000005)1.000:0.007987255)0.963:0.009745124,ERX4639416:0.031087096)0.760:0.002127560,(ERX4639417:0.014406021,TBNmA120:0.017392319)0.983:0.003844014)0.561:0.000000005)0.733:0.000446004,(((((ERX2675537:0.037328220,(ERX2822389:0.035236770,((NC_002945.4:0.085788912,(TBNmA403:0.003585076,TBNmA265:0.005594965)1.000:0.077815900)1.000:0.045871008,(ERX4639418:0.021865797,(((TBNmA389:0.011858464,(TBNmA343:0.010603912,TBNmA187:0.010539288)0.887:0.001097047)1.000:0.025977064,((TBNmA041:0.023722379,((TBNmA271:0.001127595,((TBNmA452:0.008680633,TBNmA077:0.009375017)1.000:0.005680396,(TBNmA123:0.010406082,(TBNmA085:0.011381506,TBNmA064:0.009800991)0.946:0.001327826)0.996:0.003589739)0.988:0.009195987)0.881:0.002561916,((((TBNmA327:0.000000005,TBNmA345:0.000224666)1.000:0.012329485,(TBNmA321:0.000975869,TBNmA322:0.001858213)1.000:0.009131999)1.000:0.008206124,((TBNmA382:0.007331542,(TBNmA365:0.005435424,(TBNmA306:0.006093690,TBNmA117:0.007030120)0.543:0.000000005)0.771:0.000000005)0.998:0.002534799,(ERX1275306:0.007758780,(TBNmA050:0.004680429,(TBNmA191:0.006984201,(TBNmA412:0.007273051,TBNmA258:0.009516111)0.823:0.000000005)0.909:0.000672574)0.880:0.000492079)0.976:0.001602252)1.000:0.005254102)1.000:0.008769376,(TBNmA121:0.013273022,(TBNmA280:0.005420607,(TBNmA190:0.005355244,TBNmA428:0.005210953)0.870:0.000000005)0.972:0.002255197)1.000:0.017861767)0.707:0.000136299)0.884:0.000659873)1.000:0.004134566,(TBNmA257:0.002984847,(TBNmA377:0.000000005,(TBNmA362:0.000000005,TBNmA357:0.000000005)0.881:0.000000005)0.944:0.001325432)1.000:0.025946471)0.976:0.000000005)1.000:0.017127738,(ERX512130:0.009550754,(TBNmA394:0.001180640,TBNmA331:0.001151751)1.000:0.006179602)1.000:0.034250718)0.163:0.000000005)0.189:0.000110233)0.987:0.002076210)0.823:0.000755449)1.000:0.009140266,ERX467856:0.036035736)0.667:0.000000005,((TBNmA110:0.017454586,((ERX1749324:0.012028354,TBNmA262:0.011318248)0.985:0.004246403,(TBNmA348:0.012534351,TBNmA067:0.011426344)1.000:0.005264503)0.816:0.000000005)1.000:0.016810353,(TBNmA393:0.032119007,(TBNmA074:0.012340989,(ERX467886:0.000000005,ERX467895:0.000000005)1.000:0.012723309)1.000:0.018036487)0.707:0.001395736)0.999:0.004644168)0.872:0.000000005,(TBNmA395:0.025760081,((TBNmA351:0.011266183,TBNmA266:0.008431616)1.000:0.011565390,(TBNmA383:0.014217025,(TBNmA447:0.003756605,(TBNmA305:0.003438150,TBNmA435:0.004132593)0.993:0.002834680)1.000:0.006652517)1.000:0.009294668)0.858:0.000767446)1.000:0.008991216)0.708:0.000000005,TBNmA384:0.029259232)1.000:0.014945411)1.000:0.014745950)0.681:0.000000005)0.009:0.000000005,((TBNmA124:0.004508430,(TBNmA460:0.008309784,(TBNmA125:0.003377088,TBNmA195:0.005663590)0.953:0.001124832)0.751:0.000219064)0.897:0.000448976,TBNmA290:0.006384894)0.388:0.000000005)0.624:0.000000005);
3.1.11 Credit
Information on this page has been adapted and modified from the following source(s):
Credit: https://github.com/cambiotraining/sars-cov-2-genomics
https://snipcademy.com/sequence-file-formats
https://en.wikipedia.org/wiki
Wellcome Genome Campus Advanced Courses and Scientific Conferences 2017 - WORKING WITH PATHOGEN GENOMES Course Manual http://www.wellcome.ac.uk/advancedcourses