2.3 Text Manipulation

Teaching: 20 min || Exercises: 10 min

Overview

Questions:
  • How can I inspect the content of and manipulate text files?
  • How can I find and replace text patterns within files?
  • What is an escape character?
  • What are wildcards and how do I use them?
Learning Objectives:
  • Inspect the content of text files (head, tail, cat, zcat, less).
  • Use the * wildcard to work with multiple files at once.
  • Redirect the output of a command to a file (>, >>).
  • Find a pattern in a text file (grep) and do basic pattern replacement (sed).
  • Demonstrate the usage of the sed and the “substitute” command s in processing text.
  • Demonstrate on the use of the escape character \
  • Demonstrate on the use of wildcards ” ., * and ?” in manipulating text using sed.
Key Points:
  • The head and tail commands can be used to look at the top or bottom of a file, respectively.
  • The less command can be used to interactively investigate the content of a file. Use and to browse the file and Q to quit and return to the console.
  • The cat command can be used to combine multiple files together. The zcat command can be used instead if the files are compressed.
  • The > operator redirects the output of a command into a file. If the file already exists, it’s content will be overwritten.
  • The >> operator also redirects the output of a command into a file, but appends it to any content that already exists.
  • The grep command can be used to find the lines in a text file that match a text pattern.
  • The sed tool can be used for advanced text manipulation. The “substitute” command can be used to text replacement: sed 's/pattern/replacement/options'.
  • The escape character \ invokes an alternative interpretation of the following/next character usually *, ., [, ], ?, $, ^, /, \.
  • The escape character \ can also be used to provide visual representations of non-printing characters and characters that usually have special meanings. Eg. \n: a newline., \r: a carriage return., \t: a horizontal tab.
  • Wildcards ” ., * and ?” are very useful characters used for manipulating text and they basically function as placeholders.

2.3.1 Looking Inside Files

Often we want to investigate the content of a file, without having to open it in a text editor. This is especially useful if the file is very large (as is often the case in bioinformatic applications).

For example, let’s take a look at the file TBNmA041_annotation_truncated_1.gff3. We will start by printing the whole content of the file with the cat command, which stands for “concatenate” (we will see why it’s called this way in a little while):

Navigate to the ~/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro directory and run the following command

cat TBNmA041_annotation_truncated_1.gff3

This outputs a lot of text, because the file is quite long! Take a moment and scroll through from beginning to end. In fact, this file (as you may have realized from its name), is a truncated file, and you can imagine having to scroll through the entire length of the full file. You will later find out what file format this file is and better explore its contents.

Instead of looking at the entire content of a file, it is often more useful to look only at the top few lines of the file. We can do this with the head command:

head TBNmA041_annotation_truncated_1.gff3
##gff-version 3
##feature-ontology https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/v3.1/so.obo
# Annotated with Bakta
# Software: v1.5.0
# Database: v4.0
# DOI: 10.1099/mgen.0.000685
# URL: github.com/oschwengers/bakta
##sequence-region contig_1 1 321788
contig_1    Bakta   region  1   321788  .   +   .   ID=contig_1;Name=contig_1
contig_1    Prodigal    CDS 15  272 .   +   0   ID=BDKKDL_00005;Name=hypothetical protein;...

By default, head prints the first 10 lines of the file. We can change this using the -n option, followed by a number, for example:

Print out first 3 lines
head -n 3 TBNmA041_annotation_truncated_1.gff3
##gff-version 3
##feature-ontology https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/v3.1/so.obo
# Annotated with Bakta

Similarly, we can look at the bottom few lines of a file with the tail command:

Print out last 3 lines
tail -n 3 TBNmA041_annotation_truncated_1.gff3
>contig_148
GAGTTACGTCCAGGGGTGTGGTGTACGGGCAGGTAAGGCCGGTGGGCGTGTCGTAGCCCA
GTAGTGGGCGGTCATCGCGTGATCCTTCGAAACGACCAGCAAAAGTCAATCG

Finally, if we want to open the file and browse through it, we can use the less command:

less TBNmA041_annotation_truncated_1.gff3

less will open the file and you can use and to move line-by-line or the Page Up and Page Down keys to move page-by-page. You can also use the Space Bar to navigate in bits. You can exit less by pressing Q (for “quit”). This will bring you back to the console.

Finally, it can sometimes be useful to count how many lines, words and characters a file has. We can use the wc command for this:

wc TBNmA041_annotation_truncated*
     617    1519   37336 TBNmA041_annotation_truncated_1.gff3
     722    1740   53321 TBNmA041_annotation_truncated_2.gff3
    1339    3259   90657 total

In this case, we used the * wildcard to count lines, words and characters (in that order, left-to-right) of both truncated gff3 files. Often, we only want to count one of these things, and wc has options for all of them:

  • -l counts lines only.
  • -w counts words only.
  • -c counts characters only.

For example, if we just want to know how many lines we have in each files:

Count only lines in a file
wc -l TBNmA041_annotation_truncated*
     617 TBNmA041_annotation_truncated_1.gff3
     722 TBNmA041_annotation_truncated_2.gff3
    1339 total

We will go into more details on the wc command in our next lesson.

Exercise 2.3.1.1: Looking inside files with less
  • Use the less command to look inside the file MTB_H37Rv_truncated.fasta.
  • How many lines does this file contain?
  • Use the less command again but with the option -S. Can you understand what this option does?

We can investigate the content of the reference file using less MTB_H37Rv_truncated.fasta. From this view, it looks like this file contains several lines of content: the truncated genome is more than 150kb long, so it’s not surprising we see so much text! We can use Q to quit and go back to the console.

To check the number of lines in the file, we can use the wc -l MTB_H37Rv_truncated.fasta command. The answer is only 2.

If we use less -S MTB_H37Rv_truncated.fasta the display is different this time. We see only two lines in the output. If we use the and arrows we can see that the text now goes “out of the screen”. So, what happens is that by default less will “wrap” long lines, so if a line of text is too long, it will continue it on the next line of the screen. When we use the option -S it instead displays each line individually, and we can use the arrow keys to see the content that does not fit on the screen.

Note

The annotation_truncated files we just looked into are in a format called GFF. This is a standard bioinformatic file format that stores gene coordinates and other features and has the file extension .gff. It is used to describe genes and other features of DNA, RNA and protein sequences. In this case, it corresponds to the coordinates of each annotated gene (start and end position) in the Mycobacterium tuberculosis reference genome (MTB_H37Rv). It also uses a header region with a ## string to include metadata.

In the exercise we looked at another standard file format called FASTA. This one is used to store nucleotide or amino acid sequences. In this case, the truncated nucleotide sequence of the Mycobacterium tuberculosis reference genome (MTB_H37Rv).

We will learn more about these files under File formats.

2.3.2 Combining several files

We said that the cat command we used above stands for “concatenate”. This is because this command can be used to concatenate (combine) several files together. For example, if we wanted to combine both sets of annotation_truncated .gff files into a single file:

cat TBNmA041_annotation_truncated_1.gff3 TBNmA041_annotation_truncated_2.gff3

Running the above command actually combines (concatenates) both files and print out the output. But what we will really want to do is to redirect the output to a different file.

2.3.3 Redirecting Output

The cat command we just used printed the output to the screen. But what if we wanted to save it into a file? We can achieve this by sending (or redirecting) the output of the command to a file using the > operator.

cat TBNmA041_annotation_truncated_1.gff3 TBNmA041_annotation_truncated_2.gff3 > TBNmA041_annotation_combined.gff3

Now, the output is not printed to the console, but instead sent to a new file. We can check that the file was created with ls.

If we use > and the output file already exists, its content will be replaced. If what we want to do is append the result of the command to the existing file, we should use >> instead. Let’s see this in practice in the next exercise.

Exercise 2.3.3.1: Adding data to an existing file
  1. List the files in the sequencing_run1/ directory. Save the output in a file called “sequencing_files.txt”.
  2. After performing task 1 above, what happens if you run the command ls sequencing_run2/ > sequencing_files.txt?
  3. The operator >> can be used to append the output of a command to an existing file. Try re-running both of the previous commands, but instead using the >> operator. What happens now?

Task 1

To list the files in the directory we use ls, followed by > to save the output in a file:

ls sequencing_run1/ > sequencing_files.txt

We can check the content of the file:

cat sequencing_files.txt
sample1_run1.fastq
sample2_run1.fastq
sample3_run1.fastq
sample4_run1.fastq
sample5_run1.fastq

Task 2

If we run ls sequencing_run2/ > sequencing_files.txt, we will replace the content of the file:

cat sequencing_files.txt
sample1_run2.fastq
sample2_run2.fastq
sample3_run2.fastq
sample4_run2.fastq
sample5_run2.fastq

Task 3

If we start again from the beginning, but instead use the >> operator the second time we run the command, we will append the output to the file instead of replacing it:

ls sequencing_run1/ > sequencing_files.txt
ls sequencing_run2/ >> sequencing_files.txt
cat sequencing_files.txt
sample1_run1.fastq
sample2_run1.fastq
sample3_run1.fastq
sample4_run1.fastq
sample5_run1.fastq
sample1_run2.fastq
sample2_run2.fastq
sample3_run2.fastq
sample4_run2.fastq
sample5_run2.fastq

2.3.4 Finding Patterns

This is just to serve as an introduction to the grep command. Check out detailed description in the bonus lesson and also on how to use find

Sometimes it can be very useful to find lines of a file that match a particular text pattern. We can use the tool grep (“global regular expression print”) to achieve this. For example, let’s find the word >contig in one of our annotation_truncated files:

grep ">contig" TBNmA041_annotation_truncated_1.gff3
>contig_1
>contig_2
>contig_3
>contig_4
>contig_5
>contig_100
>contig_101
>contig_102
>contig_103
...

We can see the result is all the lines that matched this word pattern.

Exercise 2.3.4.1: Finding patterns

Consider our two annotation_truncated files – TBNmA041_annotation_truncated_1.gff3 and TBNmA041_annotation_truncated_2.gff3.

  1. Create a new file called TBNmA041_contigs.txt that contains only the lines of text with the word “>contig” from both .gff files.

    Hint

    You can use grep to find a pattern in a file. You can use > to redirect the output of a command to a new file, and you cann use >> to add onto an existing file.

  2. Create a second file called TBNmA041_CDS.txt that contains only the lines of text with the acronym “CDS”. CDS stands for CoDing Sequence.

  3. Now count the number of contigs and CDS in the combined files. Are they different? What did you expect? assume each CDS represents a specific gene

Task 1

We can use grep to find the pattern in our first .gff files and use > to save the output in a new file:

grep ">contig" TBNmA041_annotation_truncated_1.gff3 > TBNmA041_contigs.txt

Next, we use >> to add the output of our second run to the file we created above TBNmA041_contigs.txt.

grep ">contig" TBNmA041_annotation_truncated_2.gff3 >> TBNmA041_contigs.txt

We could investigate the output of our command using less TBNmA041_contigs.txt.


Task 2

We will follow the same code as used in Task 1 above, by first looking for the acronym CDS in the first .gff3 file and output its result to TBNmA041_CDS.txt.

grep "CDS" TBNmA041_annotation_truncated_1.gff3 > TBNmA041_CDS.txt

Next, we use >> to add the output of our second run to the file we created above TBNmA041_contigs.txt.

grep "CDS" TBNmA041_annotation_truncated_2.gff3 >> TBNmA041_CDS.txt

We could investigate the output of our command using less TBNmA041_CDS.txt.


Task 3

We can use wc to count the lines of the newly created files:

wc -l TBNmA041_CDS.txt TBNmA041_contigs.txt
      84 TBNmA041_CDS.txt
      85 TBNmA041_contigs.txt
     169 total

This is a hypothetical data, however, note that, a given contig can have several CDS and some contigs may have no CDS. Consequently, we don’t expect the numbers to be the same in a real situation.

2.3.5 Text Replacement: sed - stream editor

One of the most prominent text-processing utilities on GNU/Linux is the sed command, which is short for “stream editor”. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

Note

In this tutorial, we’ll use the GNU version of sed (available on Ubuntu and other Linux operating systems). The macOS has the BSD version of sed which has different options and arguments. You can install the GNU version of sed with Homebrew using brew install gnu-sed.

Basic Usage

There are many instances when we want to substitute a text in a line or filter out specific lines. In such cases, we can take advantage of sed. sed operates on a stream of text which it gets either from a text file or from standard input (STDIN). It means you can use the output of another command as the input of sed – in short, you can combine sed with other commands.

By default, sed outputs everything to standard output (STDOUT). It means, unless redirected, sed will print its output onto the terminal/screen instead of saving it in a file.

Note

sed edits line-by-line and in a non-interactive way.

The basic usage of sed:

sed [OPTIONS] SCRIPT INPUT-FILES

sed contains several sub-commands, but the main one we will use is the substitute or s command. To see a list of all sed’s commands, visit https://www.gnu.org/software/sed/manual/html_node/sed-commands-list.html.

The basic syntax for the s command is:

sed 's/pattern/replacement/options'

Where s is called the sed command, pattern is the word we want to substitute also known as a regular expression (more on this later) and replacement is the new word we want to use instead.

So the basic concept of the ‘s’ command is: it attempts to match the pattern in a line, if the match is successful, then that portion of the pattern is replaced with replacement. To learn more about the ‘s’ command visit https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html.

There are also other “options” added at the end of the command, which change the default behaviour of the text substitution. Some of the common options are:

  • g: by default sed will only substitute the first match of the pattern. If we use the g option (“global”), then sed will substitute all matching text.
  • i: by default sed matches the pattern in a case-sensitive manner. For example ‘A’ (Uppercase A) and ‘a’ (Lowercase A) are treated as different. If we use the i option (“case-insensitive”) then sed will treat ‘A’ and ‘a’ as the same.

For example, let’s create a file with some text inside it:

echo "Hello world. How are you world?" > hello.txt
Note

The echo command is used to print some text on the console. In this case we are sending that text to a file to use in our example. Another pretty way of easily creating a text file.

If we do:

sed 's/world/participant/' hello.txt

This is the result

Hello participant. How are you world?

We can see that the first “world” word was replaced with “participant”. This is the default behaviour of sed: only the first pattern it finds in a line of text is replaced with the new word. We can modify this by using the g option after the last /:

NB. first rewrite the hello.txt file to see the full effect as we have already changed the first world.

echo "Hello world. How are you world?" > hello.txt
sed 's/world/participant/g' hello.txt
Hello participant. How are you participant?

Create a new file ‘input.txt’ and write the following text in it:

Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different cases. test tEst TesT.

Now try to replace ‘test’ with ‘hello’. We can do something like this:

sed 's/test/hello/' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different case. hello tEst TesT.

You may have noticed that lines two and three still have ‘test’. This is because we ask the sed to only replace the first text which matches. To replace all the matches, we have to use g option. Let’s try with g option.

sed 's/test/hello/g' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the Test with different case. hello tEst TesT.

Ah, something is still wrong with the third line. It is not replacing ‘Test’, ‘tEst’ and ‘TesT’. We have to do something to tell the sed that we want to replace all of them. We can do this by using i option. Let’s add one more option:

sed 's/test/hello/gi' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.

Wonderful!

Let’s say now we only want to replace all the occurrences of the ‘test’ at only line 3 or from line 2 to 3. Try to remember the basic syntax for the sed command. Remember! You can add the address of the line or a range of lines that you want to edit. Here is an example:

sed '3s/test/hello/gi' input.txt
Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the hello with different case. hello hello hello.

See only the third line is executed by the sed. The first two lines are as it is. At the beginning of the ‘s’ command, we are adding the line number which we want to edit. We can also add a range of lines. Here is one more example:

sed '2,3s/test/hello/gi' input.txt

As you may have got the idea, it will edit lines 2 and 3. The output will be:

Hello, this is a test line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.
Note

Regular Expressions

Finding patterns in text can be a very powerful skill to master. In our examples we have been finding a literal word and replacing it with another word. However, we can do more complex text substitutions by using special keywords that define a more general pattern. These are known as regular expressions.

For example, in regular expression syntax, the character . stands for “any character”. So, for example, the pattern H. would match a “H” followed by any character, and the expression:

sed 's/H./X/g' hello.txt

Results in:

Xllo world. Xw are you world?

Notice how both “He” (at the start of the word “Hello”) and “Ho” (at the start of the word “How”) are replaced with the letter “X”. Because both of them match the pattern “H followed by any character” (H.).

To learn more see this Regular Expression Cheatsheet.

The \ Escape Character

You may have asked yourself, if the “forward slash” / character is used to separate parts of the sed substitute command, then how would we replace the “/” character itself in a piece of text? For example, let’s add a new line of text to our file:

echo "Welcome to this workshop/course." >> hello.txt

Let’s say we wanted to replace “workshop/course” with “tutorial” in this text. If we did:

sed 's/workshop/course/tutorial/' hello.txt

We would get an error:

sed: -e expression #1, char 5: unknown option to `s'

This is because we ended up with too many / in the command, and sed uses that to separate its different parts of the command. In this situation we need to tell sed to ignore that / as being a special character but instead treat it as the literal “/” character. To do this, we need to use “backslash” \ before /, which is called the “escape” character. That will tell sed to treat the / as a normal character rather than a separator of its commands.

So:

sed 's/workshop\/course/tutorial/' hello.txt
                
           This / is "escaped" with \ beforehand

This looks a little strange, but the main thing to remember is that \/ will be interpreted as the character “/” rather than the separator of sed’s substitute command.

Escape character

An escape character is a character that invokes an alternative interpretation of the following character. Sometimes it is also used to insert unallowed characters in a string. An escape character is a backslash \ followed by a character (or characters). Some of the keywords/characters which you want to escape are as follows:

  • *: Asterisk.
  • .: Dot.
  • [: Left square bracket.
  • ]: Right square bracket.
  • ?: Question mark.
  • $: Dollar sign.
  • ^: Caret
  • /: Forward slash
  • \: Backward slash
Special use of the escape character

Escape characters are also used to provide visual representations of non-printing characters and characters that usually have special meanings. The list of commonly used escape characters in the sed is as follows:

  • \n: a newline.
  • \r: a carriage return.
  • \t: a horizontal tab.

For more details visit https://www.gnu.org/software/sed/manual/sed.html#Escapes.

Exercise 2.3.5.1: Use of the escape character \

The file in bacteria_rpob/bacteria_truncated_rpob.fasta contains the nucleotide sequences of bacteria RNA polymerase beta subunit genes (rpob) for 4 bacteria samples.

Let’s have a lool at the file
cat bacteria_rpob/bacteria_truncated_rpob.fasta
>Mycobacterium_tuberculosis_H37Rv_trunc_rpob_gene
TTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGTCCGAGTCGCCCGCAAAGTTCCTCGAATAACTC
>Salmonella_enterica_truncated/incomplete_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTA...............
>Staphylococcus_aureus_subsp._aureus_Trunc_rpob_gene
TTGGCAGGTCAAGTTGTCCAATATGGAAGACATCGTAAACGTAGAAACTACGCGAGAATTTCAGAAGTATTAAC
>Streptococcus_pneumoniae_TRUNC_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTACCCGTCGTAGTTTTTCAAGAATCAAAGAAGTTCAAAT

We will cover FASTA files in a subsequent section, for now all we need to know is that each sequence has a name in a line that starts with the character >.

Use sed to achieve the following:

  1. Substitute the word trunc with truncated.
    Hint Similar to how you can use the g option to do “global” substitution, you can also use the i option to do case-insensitive text substitution.
  2. Substitute the word /incomplete with -missing.
  3. The character . is also a keyword used in regular expressions to mean “any character”. See what happens if you run the command sed 's/./X/g' bacteria_rpob/bacteria_truncated_rpob.fasta. How would you fix this command to literally only substitute the character . with X?

Task 1

To replace the word trunc with truncated, we can do:
sed 's/trunc/truncated/i' bacteria_rpob/bacteria_truncated_rpob.fasta
>Mycobacterium_tuberculosis_H37Rv_truncated_rpob_gene
TTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGTCCGAGTCGCCCGCAAAGTTCCTCGAATAACTC
>Salmonella_enterica_truncatedated/incomplete_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTA...............
>Staphylococcus_aureus_subsp._aureus_truncated_rpob_gene
TTGGCAGGTCAAGTTGTCCAATATGGAAGACATCGTAAACGTAGAAACTACGCGAGAATTTCAGAAGTATTAAC
>Streptococcus_pneumoniae_truncated_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTACCCGTCGTAGTTTTTCAAGAATCAAAGAAGTTCAAAT

We have to use the option i to tell the sed that we want to match the word trunc in case-insensitive manner.


Task 2

For the second task, if we do:

sed 's//incomplete/-missing/'  bacteria_rpob/bacteria_truncated_rpob.fasta

We will get an error. We need to use \ to escape the / keyword in our pattern, so:

Use of the \ escape character - I
sed 's/\/incomplete/-missing/'  bacteria_rpob/bacteria_truncated_rpob.fasta
>Mycobacterium_tuberculosis_H37Rv_trunc_rpob_gene
TTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGTCCGAGTCGCCCGCAAAGTTCCTCGAATAACTC
>Salmonella_enterica_truncated-missing_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTA...............
>Staphylococcus_aureus_subsp._aureus_Trunc_rpob_gene
TTGGCAGGTCAAGTTGTCCAATATGGAAGACATCGTAAACGTAGAAACTACGCGAGAATTTCAGAAGTATTAAC
>Streptococcus_pneumoniae_TRUNC_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTACCCGTCGTAGTTTTTCAAGAATCAAAGAAGTTCAAAT

Task 3

Finally, to replace the character ., if we do:

Use of the \ escape character - II
sed 's/./X/g' bacteria_rpob/bacteria_truncated_rpob.fasta
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Everything becomes “X”! That’s because . is a keyword used in regular expressions to mean “any character”. Because we are using the g option (for “global substitution”), we replaced every single character with “X”. To literally replace the character “.”, we need to again use the \ escape character, so:

sed 's/\./X/g' bacteria_rpob/bacteria_truncated_rpob.fasta
>Mycobacterium_tuberculosis_H37Rv_trunc_rpob_gene
TTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGTCCGAGTCGCCCGCAAAGTTCCTCGAATAACTC
>Salmonella_enterica_truncated/incomplete_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAXXXXXXXXXXXXXXX
>Staphylococcus_aureus_subspX_aureus_Trunc_rpob_gene
TTGGCAGGTCAAGTTGTCCAATATGGAAGACATCGTAAACGTAGAAACTACGCGAGAATTTCAGAAGTATTAAC
>Streptococcus_pneumoniae_TRUNC_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTACCCGTCGTAGTTTTTCAAGAATCAAAGAAGTTCAAAT

Wildcards in sed

We have already covered a bit of wildcards on “Navigating Files and Directories” page; here, we will discuss the use of ., * and ? in the sed. These wildcards are part of the regular expression.

  • You can use . as a placeholder for any character except newline (\n) or empty text. For example, if you use . in your regular expression like x.z then it will match to strings like xaz, xbz, x1z, xzz, etc., but, it will not match to xz.

Also recall:

  • You can use * to match 0 or more occurrences of the previous character. For example, xy*z will match to strings like xz (0 occurrences of y), xyz (1 occurrence of y), xyyz and so on.

  • ? is a bit similar to *. The difference is it will only match for 0 or 1 occurrence of the previous character. For example, xy?z will match to strings like xz (0 occurrences of y), xyz (1 occurrence of y) but not to xyyz.

Now, let’s do some coding. Create a file input2.txt and copy the following sequence:

ATGCCTGATTGGGCTACGTCGTAAGCGATGGCTAGGTATCGTAAAGGGGTTTGGGAACCCCAATCACTAGCT
Hint

You can do this simply by using echo:

echo 
"ATGCCTGATTGGGCTACGTCGTAAGCGATGGCTAGGTATCGTAAAGGGGTTTGGGAACCCCAATCACTAGCT" 
> input2.txt 

Let’s say we want to replace anything between A and G with U. We can do this using the sed:

sed 's/A.G/AUG/g' input2.txt
AUGCCTGATTGGGCTAUGTCGTAUGCGAUGGCTAUGTATCGTAAUGGGGTTTGGGAACCCCAATCACTAGCT
Exercise 2.3.5.2: Replace Me

Navigate to the bacteria_rpob/ direcrory and have a look at the rpob_gene.fasta file. You have four tasks to do on this file. They are:

  • Replace the word bacteria with sample. Make sure to replace all words matching to bacteria in a case insensitive manner.
  • Replace all the . with X.
  • Remove /incomplete from one of the sample names.
  • Save the output to rpob_gene_processed.fasta.

Let’s first navigate to the bacteria_rpob/ directory.

cd bacteria_rpob/

We will now walk through the solution step by step. At first, to replace the word bacteria with sample, we can do something like this:

sed 's/bacteria/sample/i' rpob_gene.fasta
>sample_1_rpob_gene
CCGAGTCGCCCGCAAAGTTCCTCGAATAACTCTTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGT
>sample_2_rpob_gene
AATATGGAAGACATCGTAAACGTAGAAACTACGCTTGGCAGGTCAAGTTGTCCGAGAATTTCAGAAGTATTAAC
>sample_3_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGT..............TTCAAGAATCAAAGAAGTTCAAAT
>sample_4/incomplete_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTA...............

We have to use the flag i to tell the sed that we want to match the word bacteria in case insensitive manner.

Second we have to replace all the . with X. Remember the . is a keyword (or has special meaning in the sed). So, to have the literal meaning of ., we have to escape the . with \. We can replace all the . as follows:

sed 's/\./X/g' rpob_gene.fasta 
>bacteria_1_rpob_gene
CCGAGTCGCCCGCAAAGTTCCTCGAATAACTCTTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGT
>Bacteria_2_rpob_gene
AATATGGAAGACATCGTAAACGTAGAAACTACGCTTGGCAGGTCAAGTTGTCCGAGAATTTCAGAAGTATTAAC
>BACTERIA_3_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTXXXXXXXXXXXXXXTTCAAGAATCAAAGAAGTTCAAAT
>Bacteria_4/incomplete_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAXXXXXXXXXXXXXXX

In lines 3 and 4, you can see that all the . are replaced by X. Note that to replace all the . we also have to use g flag.

Third, we have to remove \incomplete. Again / is a keyword in the sed. We have to do something similar to the previous step.

sed 's/\/incomplete//' rpob_gene.fasta
>bacteria_1_rpob_gene
CCGAGTCGCCCGCAAAGTTCCTCGAATAACTCTTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGT
>Bacteria_2_rpob_gene
AATATGGAAGACATCGTAAACGTAGAAACTACGCTTGGCAGGTCAAGTTGTCCGAGAATTTCAGAAGTATTAAC
>BACTERIA_3_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGT..............TTCAAGAATCAAAGAAGTTCAAAT
>Bacteria_4_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTA...............

Now, we have to combine all three steps (this we can do by using pipe|) and then redirect the output to a file rather than the default output. We will talk more about pipe | in the next session.

sed 's/bacteria/sample/i' rpob_gene.fasta | sed 's/\./X/g' | 
sed 's/\/incomplete//' > rpob_gene_processed.fasta

You will see no output in the terminal. But you will see a new file rpob_gene_processed.fasta, which will look similar to the following:

>sample_1_rpob_gene
CCGAGTCGCCCGCAAAGTTCCTCGAATAACTCTTGGCAGATTCCCGCCAGAGCAAAACAGCCGCTAGTCCTAGT
>sample_2_rpob_gene
AATATGGAAGACATCGTAAACGTAGAAACTACGCTTGGCAGGTCAAGTTGTCCGAGAATTTCAGAAGTATTAAC
>sample_3_rpob_gene
TTGGCAGGACATGACGTTCAATACGGGAAACATCGTXXXXXXXXXXXXXXTTCAAGAATCAAAGAAGTTCAAAT
>sample_4_rpob_gene
GAGGAACCCTATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAXXXXXXXXXXXXXXX

To learn more about the use of sed visit https://www.gnu.org/software/sed/manual/sed.html

2.3.6 Credit

Information on this page has been adapted and modified from the following source:

  • https://github.com/cambiotraining/sars-cov-2-genomics