How can I inspect the content of and manipulate text files?
How can I find and replace text patterns within files?
What is an escape character?
What are wildcards and how do I use them?
Learning Objectives:
Inspect the content of text files (head, tail, cat, zcat, less).
Use the * wildcard to work with multiple files at once.
Redirect the output of a command to a file (>, >>).
Find a pattern in a text file (grep) and do basic pattern replacement (sed).
Demonstrate the usage of the sed and the “substitute” command s in processing text.
Demonstrate on the use of the escape character \
Demonstrate on the use of wildcards ” ., * and ?” in manipulating text using sed.
Key Points:
The head and tail commands can be used to look at the top or bottom of a file, respectively.
The less command can be used to interactively investigate the content of a file. Use ↑ and ↓ to browse the file and Q to quit and return to the console.
The cat command can be used to combine multiple files together. The zcat command can be used instead if the files are compressed.
The > operator redirects the output of a command into a file. If the file already exists, it’s content will be overwritten.
The >> operator also redirects the output of a command into a file, but appends it to any content that already exists.
The grep command can be used to find the lines in a text file that match a text pattern.
The sed tool can be used for advanced text manipulation. The “substitute” command can be used to text replacement: sed 's/pattern/replacement/options'.
The escape character \ invokes an alternative interpretation of the following/next character usually *, ., [, ], ?, $, ^, /, \.
The escape character \ can also be used to provide visual representations of non-printing characters and characters that usually have special meanings. Eg. \n: a newline., \r: a carriage return., \t: a horizontal tab.
Wildcards ” ., * and ?” are very useful characters used for manipulating text and they basically function as placeholders.
2.3.1 Looking Inside Files
Often we want to investigate the content of a file, without having to open it in a text editor. This is especially useful if the file is very large (as is often the case in bioinformatic applications).
For example, let’s take a look at the file TBNmA041_annotation_truncated_1.gff3. We will start by printing the whole content of the file with the cat command, which stands for “concatenate” (we will see why it’s called this way in a little while):
Navigate to the ~/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro directory and run the following command
cat TBNmA041_annotation_truncated_1.gff3
This outputs a lot of text, because the file is quite long! Take a moment and scroll through from beginning to end. In fact, this file (as you may have realized from its name), is a truncated file, and you can imagine having to scroll through the entire length of the full file. You will later find out what file format this file is and better explore its contents.
Instead of looking at the entire content of a file, it is often more useful to look only at the top few lines of the file. We can do this with the head command:
Finally, if we want to open the file and browse through it, we can use the less command:
less TBNmA041_annotation_truncated_1.gff3
less will open the file and you can use ↑ and ↓ to move line-by-line or the Page Up and Page Down keys to move page-by-page. You can also use the Space Bar to navigate in bits. You can exit less by pressing Q (for “quit”). This will bring you back to the console.
Finally, it can sometimes be useful to count how many lines, words and characters a file has. We can use the wc command for this:
In this case, we used the * wildcard to count lines, words and characters (in that order, left-to-right) of both truncated gff3 files. Often, we only want to count one of these things, and wc has options for all of them:
-l counts lines only.
-w counts words only.
-c counts characters only.
For example, if we just want to know how many lines we have in each files:
Count only lines in a file
wc-l TBNmA041_annotation_truncated*
617 TBNmA041_annotation_truncated_1.gff3
722 TBNmA041_annotation_truncated_2.gff3
1339 total
We will go into more details on the wc command in our next lesson.
Exercise 2.3.1.1: Looking inside files with less
Use the less command to look inside the file MTB_H37Rv_truncated.fasta.
How many lines does this file contain?
Use the less command again but with the option -S. Can you understand what this option does?
Solution:
We can investigate the content of the reference file using less MTB_H37Rv_truncated.fasta. From this view, it looks like this file contains several lines of content: the truncated genome is more than 150kb long, so it’s not surprising we see so much text! We can use Q to quit and go back to the console.
To check the number of lines in the file, we can use the wc -l MTB_H37Rv_truncated.fasta command. The answer is only 2.
If we use less -S MTB_H37Rv_truncated.fasta the display is different this time. We see only two lines in the output. If we use the → and ← arrows we can see that the text now goes “out of the screen”. So, what happens is that by default less will “wrap” long lines, so if a line of text is too long, it will continue it on the next line of the screen. When we use the option -S it instead displays each line individually, and we can use the arrow keys to see the content that does not fit on the screen.
Note
The annotation_truncated files we just looked into are in a format called GFF. This is a standard bioinformatic file format that stores gene coordinates and other features and has the file extension .gff. It is used to describe genes and other features of DNA, RNA and protein sequences. In this case, it corresponds to the coordinates of each annotated gene (start and end position) in the Mycobacterium tuberculosis reference genome (MTB_H37Rv). It also uses a header region with a ## string to include metadata.
In the exercise we looked at another standard file format called FASTA. This one is used to store nucleotide or amino acid sequences. In this case, the truncated nucleotide sequence of the Mycobacterium tuberculosis reference genome (MTB_H37Rv).
We will learn more about these files under File formats.
2.3.2 Combining several files
We said that the cat command we used above stands for “concatenate”. This is because this command can be used to concatenate (combine) several files together. For example, if we wanted to combine both sets of annotation_truncated.gff files into a single file:
Running the above command actually combines (concatenates) both files and print out the output. But what we will really want to do is to redirect the output to a different file.
2.3.3 Redirecting Output
The cat command we just used printed the output to the screen. But what if we wanted to save it into a file? We can achieve this by sending (or redirecting) the output of the command to a file using the > operator.
Now, the output is not printed to the console, but instead sent to a new file. We can check that the file was created with ls.
If we use > and the output file already exists, its content will be replaced. If what we want to do is append the result of the command to the existing file, we should use >> instead. Let’s see this in practice in the next exercise.
Exercise 2.3.3.1: Adding data to an existing file
List the files in the sequencing_run1/ directory. Save the output in a file called “sequencing_files.txt”.
After performing task 1 above, what happens if you run the command ls sequencing_run2/ > sequencing_files.txt?
The operator >> can be used to append the output of a command to an existing file. Try re-running both of the previous commands, but instead using the >> operator. What happens now?
Solution:
Task 1
To list the files in the directory we use ls, followed by > to save the output in a file:
If we start again from the beginning, but instead use the >> operator the second time we run the command, we will append the output to the file instead of replacing it:
ls sequencing_run1/ > sequencing_files.txtls sequencing_run2/ >> sequencing_files.txtcat sequencing_files.txt
This is just to serve as an introduction to the grep command. Check out detailed description in the bonus lesson and also on how to use find
Sometimes it can be very useful to find lines of a file that match a particular text pattern. We can use the tool grep (“global regular expression print”) to achieve this. For example, let’s find the word >contig in one of our annotation_truncated files:
We can see the result is all the lines that matched this word pattern.
Exercise 2.3.4.1: Finding patterns
Consider our two annotation_truncated files – TBNmA041_annotation_truncated_1.gff3 and TBNmA041_annotation_truncated_2.gff3.
Create a new file called TBNmA041_contigs.txt that contains only the lines of text with the word “>contig” from both .gff files.
Hint
You can use grep to find a pattern in a file. You can use > to redirect the output of a command to a new file, and you cann use >> to add onto an existing file.
Create a second file called TBNmA041_CDS.txt that contains only the lines of text with the acronym “CDS”. CDS stands for CoDing Sequence.
Now count the number of contigs and CDS in the combined files. Are they different? What did you expect? assume each CDS represents a specific gene
Solution:
Task 1
We can use grep to find the pattern in our first .gff files and use > to save the output in a new file:
We could investigate the output of our command using less TBNmA041_contigs.txt.
Task 2
We will follow the same code as used in Task 1 above, by first looking for the acronym CDS in the first .gff3 file and output its result to TBNmA041_CDS.txt.
We could investigate the output of our command using less TBNmA041_CDS.txt.
Task 3
We can use wc to count the lines of the newly created files:
wc-l TBNmA041_CDS.txt TBNmA041_contigs.txt
84 TBNmA041_CDS.txt
85 TBNmA041_contigs.txt
169 total
This is a hypothetical data, however, note that, a given contig can have several CDS and some contigs may have no CDS. Consequently, we don’t expect the numbers to be the same in a real situation.
2.3.5 Text Replacement: sed - stream editor
One of the most prominent text-processing utilities on GNU/Linux is the sed command, which is short for “stream editor”. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).
Note
In this tutorial, we’ll use the GNU version of sed (available on Ubuntu and other Linux operating systems). The macOS has the BSD version of sed which has different options and arguments. You can install the GNU version of sed with Homebrew using brew install gnu-sed.
Basic Usage
There are many instances when we want to substitute a text in a line or filter out specific lines. In such cases, we can take advantage of sed. sed operates on a stream of text which it gets either from a text file or from standard input (STDIN). It means you can use the output of another command as the input of sed – in short, you can combine sed with other commands.
By default, sed outputs everything to standard output (STDOUT). It means, unless redirected, sed will print its output onto the terminal/screen instead of saving it in a file.
Note
sed edits line-by-line and in a non-interactive way.
Where s is called the sed command, pattern is the word we want to substitute also known as a regular expression(more on this later) and replacement is the new word we want to use instead.
There are also other “options” added at the end of the command, which change the default behaviour of the text substitution. Some of the common options are:
g: by default sed will only substitute the first match of the pattern. If we use the g option (“global”), then sed will substitute all matching text.
i: by default sed matches the pattern in a case-sensitive manner. For example ‘A’ (Uppercase A) and ‘a’ (Lowercase A) are treated as different. If we use the i option (“case-insensitive”) then sed will treat ‘A’ and ‘a’ as the same.
For example, let’s create a file with some text inside it:
echo"Hello world. How are you world?"> hello.txt
Note
The echo command is used to print some text on the console. In this case we are sending that text to a file to use in our example. Another pretty way of easily creating a text file.
If we do:
sed's/world/participant/' hello.txt
This is the result
Hello participant. How are you world?
We can see that the first “world” word was replaced with “participant”. This is the default behaviour of sed: only the first pattern it finds in a line of text is replaced with the new word. We can modify this by using the g option after the last /:
NB. first rewrite the hello.txt file to see the full effect as we have already changed the first world.
echo"Hello world. How are you world?"> hello.txtsed's/world/participant/g' hello.txt
Hello participant. How are you participant?
Click to expand for more interesting examples on the use of the s command
Create a new file ‘input.txt’ and write the following text in it:
Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different cases. test tEst TesT.
Now try to replace ‘test’ with ‘hello’. We can do something like this:
sed's/test/hello/' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different case. hello tEst TesT.
You may have noticed that lines two and three still have ‘test’. This is because we ask the sed to only replace the first text which matches. To replace all the matches, we have to use g option. Let’s try with g option.
sed's/test/hello/g' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the Test with different case. hello tEst TesT.
Ah, something is still wrong with the third line. It is not replacing ‘Test’, ‘tEst’ and ‘TesT’. We have to do something to tell the sed that we want to replace all of them. We can do this by using i option. Let’s add one more option:
sed's/test/hello/gi' input.txt
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.
Wonderful!
Let’s say now we only want to replace all the occurrences of the ‘test’ at only line 3 or from line 2 to 3. Try to remember the basic syntax for the sed command. Remember! You can add the address of the line or a range of lines that you want to edit. Here is an example:
sed'3s/test/hello/gi' input.txt
Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the hello with different case. hello hello hello.
See only the third line is executed by the sed. The first two lines are as it is. At the beginning of the ‘s’ command, we are adding the line number which we want to edit. We can also add a range of lines. Here is one more example:
sed'2,3s/test/hello/gi' input.txt
As you may have got the idea, it will edit lines 2 and 3. The output will be:
Hello, this is a test line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.
Note
Regular Expressions
Finding patterns in text can be a very powerful skill to master. In our examples we have been finding a literal word and replacing it with another word. However, we can do more complex text substitutions by using special keywords that define a more general pattern. These are known as regular expressions.
For example, in regular expression syntax, the character . stands for “any character”. So, for example, the pattern H. would match a “H” followed by any character, and the expression:
sed's/H./X/g' hello.txt
Results in:
Xllo world. Xw are you world?
Notice how both “He” (at the start of the word “Hello”) and “Ho” (at the start of the word “How”) are replaced with the letter “X”. Because both of them match the pattern “H followed by any character” (H.).
You may have asked yourself, if the “forward slash” / character is used to separate parts of the sed substitute command, then how would we replace the “/” character itself in a piece of text? For example, let’s add a new line of text to our file:
echo"Welcome to this workshop/course.">> hello.txt
Let’s say we wanted to replace “workshop/course” with “tutorial” in this text. If we did:
sed's/workshop/course/tutorial/' hello.txt
We would get an error:
sed: -e expression #1, char 5: unknown option to `s'
This is because we ended up with too many / in the command, and sed uses that to separate its different parts of the command. In this situation we need to tell sed to ignore that / as being a special character but instead treat it as the literal “/” character. To do this, we need to use “backslash” \ before /, which is called the “escape” character. That will tell sed to treat the / as a normal character rather than a separator of its commands.
So:
sed's/workshop\/course/tutorial/' hello.txt↑This / is "escaped" with \ beforehand
This looks a little strange, but the main thing to remember is that \/ will be interpreted as the character “/” rather than the separator of sed’s substitute command.
Escape character
An escape character is a character that invokes an alternative interpretation of the following character. Sometimes it is also used to insert unallowed characters in a string. An escape character is a backslash \ followed by a character (or characters). Some of the keywords/characters which you want to escape are as follows:
*: Asterisk.
.: Dot.
[: Left square bracket.
]: Right square bracket.
?: Question mark.
$: Dollar sign.
^: Caret
/: Forward slash
\: Backward slash
Special use of the escape character
Escape characters are also used to provide visual representations of non-printing characters and characters that usually have special meanings. The list of commonly used escape characters in the sed is as follows:
The file in bacteria_rpob/bacteria_truncated_rpob.fasta contains the nucleotide sequences of bacteria RNA polymerase beta subunit genes (rpob) for 4 bacteria samples.
We will cover FASTA files in a subsequent section, for now all we need to know is that each sequence has a name in a line that starts with the character >.
Use sed to achieve the following:
Substitute the word trunc with truncated.
Hint
Similar to how you can use the g option to do “global” substitution, you can also use the i option to do case-insensitive text substitution.
Substitute the word /incomplete with -missing.
The character . is also a keyword used in regular expressions to mean “any character”. See what happens if you run the command sed 's/./X/g' bacteria_rpob/bacteria_truncated_rpob.fasta. How would you fix this command to literally only substitute the character . with X?
Solution:
Task 1
To replace the word trunc with truncated, we can do:
Everything becomes “X”! That’s because . is a keyword used in regular expressions to mean “any character”. Because we are using the g option (for “global substitution”), we replaced every single character with “X”. To literally replace the character “.”, we need to again use the \ escape character, so:
We have already covered a bit of wildcards on “Navigating Files and Directories” page; here, we will discuss the use of ., * and ? in the sed. These wildcards are part of the regular expression.
You can use . as a placeholder for any character except newline (\n) or empty text. For example, if you use . in your regular expression like x.z then it will match to strings like xaz, xbz, x1z, xzz, etc., but, it will not match to xz.
Also recall:
You can use * to match 0 or more occurrences of the previous character. For example, xy*z will match to strings like xz (0 occurrences of y), xyz (1 occurrence of y), xyyz and so on.
? is a bit similar to *. The difference is it will only match for 0 or 1 occurrence of the previous character. For example, xy?z will match to strings like xz (0 occurrences of y), xyz (1 occurrence of y) but not to xyyz.
Now, let’s do some coding. Create a file input2.txt and copy the following sequence:
We have to use the flag i to tell the sed that we want to match the word bacteria in case insensitive manner.
Second we have to replace all the . with X. Remember the . is a keyword (or has special meaning in the sed). So, to have the literal meaning of ., we have to escape the . with \. We can replace all the . as follows:
Now, we have to combine all three steps (this we can do by using pipe|) and then redirect the output to a file rather than the default output. We will talk more about pipe | in the next session.