2.4 Combining Commands

Teaching: 20 min || Exercises: 10 min

Overview

Questions:
  • How can I combine existing commands to do new things?
Learning Objectives:
  • Redirect a command’s output to a file.
  • Process a file instead of keyboard input using redirection.
  • Construct command pipelines with two or more stages.
  • Explain what usually happens if a program or pipeline isn’t given any input to process.
Key Points:
  • Other useful commands to manipulate text and which are often used together in pipes include:
    • cat displays the contents of its inputs.
    • head displays the first 10 lines of its input.
    • tail displays the last 10 lines of its input.
    • sort used to order the input lines alphabetically (default) or numerically (-n option).
    • wc counts lines, words, and characters in its inputs.
    • cut to extract parts of a file that are separated by a delimiter.
    • uniq to obtain only unique lines from its input. The -c option can also be used to count how often each of those unique lines occur.
  • command > file redirects a command’s output to a file (overwriting any existing content).
  • command >> file appends a command’s output to a file.
  • < operator redirects input to a command
  • The | pipe allows to chain several commands together.
  • first | second is a pipeline: the output of the first command is used as the input to the second.
  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

2.4.1 A few more basic commands

Let’s go over and perform some basic commands we have already encountered, but this time, we may go deeper and also introduce new commands.

Performing counts with wc

We’ll start with a directory called molecules that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

ls molecules
cubane.pdb    ethane.pdb    methane.pdb
octane.pdb    pentane.pdb   propane.pdb

Let’s go into that directory with cd and run the command wc *.pdb. wc is the “word count” command: it counts the number of lines, words, and characters in files (from left to right, in that order).

The * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:

cd molecules
wc *.pdb
  20  156  1158  cubane.pdb
  12  84   622   ethane.pdb
   9  57   422   methane.pdb
  30  246  1828  octane.pdb
  21  165  1226  pentane.pdb
  15  111  825   propane.pdb
 107  819  6081  total

If we run wc -l instead of just wc, the output shows only the number of lines per file:

wc -l *.pdb
  20  cubane.pdb
  12  ethane.pdb
   9  methane.pdb
  30  octane.pdb
  21  pentane.pdb
  15  propane.pdb
 107  total
Why Isn’t It Doing Anything?

What happens if a command is supposed to process a file, but we don’t give it a filename? For example, what if we type:

wc -l

but don’t type *.pdb (or anything else) after the command? Since it doesn’t have any filenames, wc assumes it is supposed to process input given at the command prompt, so it just sits there and waits for us to give it some data interactively. From the outside, though, all we see is it sitting there: the command doesn’t appear to do anything.

If you make this kind of mistake, you can escape out of this state by holding down the control key (Ctrl) and typing the letter C once and letting go of the Ctrl key. Ctrl+C

We can also use -w to get only the number of words, or -c to get only the number of characters.

Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:

wc -l *.pdb > lengths.txt

The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution. ls lengths.txt confirms that the file exists:

ls lengths.txt
lengths.txt

Printing file content to screen with cat

We have already encountered the cat command in our previous lesson. We will go a little deeper here and see how its useful when combining commands.

We can now send the content of lengths.txt to the screen using cat lengths.txt. The cat command gets its name from “concatenate” i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:

cat lengths.txt
  20  cubane.pdb
  12  ethane.pdb
   9  methane.pdb
  30  octane.pdb
  21  pentane.pdb
  15  propane.pdb
 107  total
Output Page by Page

We’ll continue to use cat in this lesson, for convenience and consistency, but it has the disadvantage that it always dumps the whole file onto your screen. More useful in practice is the command less, which you use with less lengths.txt. This displays a screenful of the file, and then stops. You can go forward one screenful by pressing the spacebar, or back one by pressing b. Press q to quit.

Sorting with the sort command

Now let’s use the sort command to sort its contents.

What Does sort -n Do?

If we run sort on a file containing the following lines:

10
2
19
22
6

the output is:

10
19
2
22
6

If we run sort -n on the same input, we get this instead:

2
6
10
19
22
Note

The -n option specifies a numerical rather than an alphanumerical sort.

We will also use the -n option to specify that the sort is numerical instead of alphanumerical. This does not change the file; instead, it sends the sorted result to the screen:

sort -n lengths.txt
  9  methane.pdb
 12  ethane.pdb
 15  propane.pdb
 20  cubane.pdb
 21  pentane.pdb
 30  octane.pdb
107  total

We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:

sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt
  9  methane.pdb

Using -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines.

Redirecting to the same file >>

It’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. For example:

sort -n lengths.txt > lengths.txt

Doing something like this may give you incorrect results and/or delete the contents of lengths.txt.

What Does >> Mean?

We have seen the use of >, but there is a similar operator >> which works slightly different. We’ll learn about the differences between these two operators by printing some strings. We can use the echo command to print strings e.g.

echo The echo command prints text
The echo command prints text

Now test the commands below to reveal the difference between the two operators:

(Try executing each command twice in a row and then examining the output files)

echo hello > testfile01.txt

and:

echo hello >> testfile02.txt
Note

In the first example with >, the string “hello” is written to testfile01.txt, but the file gets overwritten each time we run the command.

We see from the second example that the >> operator also writes “hello” to a file (in this casetestfile02.txt), but appends the string to the file if it already exists (i.e. when we run it for the second time).

Exercise 2.4.1.1: Appending Data

We have already met the head command, which prints lines from the start of a file. tail is similar, but prints lines from the end of a file instead.

Consider the file bacteria.txt. After these commands, select the answer that corresponds to the file bacteria-subset.txt:

head -n 3 bacteria.txt > bacteria-subset.txt
tail -n 2 bacteria.txt >> bacteria-subset.txt
  1. The first three lines of bacteria.txt
  2. The last two lines of bacteria.txt
  3. The first three lines and the last two lines of bacteria.txt
  4. The second and third lines of bacteria.txt

Option 3 is correct.

For option 1 to be correct we would only run the head command.

For option 2 to be correct we would only run the tail command.

For option 4 to be correct we would have to pipe the output of head into tail -n 2 by doing head -n 3 bacteria.txt | tail -n 2 > bacteria-subset.txt

If you think this is confusing, that’s fine: even once you understand what wc, sort, and head do, all those intermediate files make it hard to follow what’s going on.

Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways.

2.4.2 The | Pipe

The way we can combine commands together is using a pipe, which uses the special operator |.

From our previous exercise, we can make it easier to understand by running sort and head together:

sort -n lengths.txt | head -n 1
  9  methane.pdb

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

Nothing prevents us from chaining pipes consecutively. That is, we can for example send the output of wc directly to sort, and then the resulting output to head. Thus we first use a pipe to send the output of wc to sort:

wc -l *.pdb | sort -n
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total

And now we send the output of this pipe, through another pipe, to head, so that the full pipeline becomes:

wc -l *.pdb | sort -n | head -n 1
   9  methane.pdb

This is exactly like a mathematician nesting functions like log(3x) and saying “the log of three times x”. In our case, the calculation is “head of sort of line count of *.pdb”.

The redirection and pipes used in the last few commands are illustrated below:

Redirects and Pipes

Exercise 2.4.2.1: Piping Commands Together

In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

  1. wc -l * > sort -n > head -n 3
  2. wc -l * | sort -n | head -n 1-3
  3. wc -l * | head -n 3 | sort -n
  4. wc -l * | sort -n | head -n 3

Option 4 is the solution. The pipe character | is used to connect the output from one command to the input of another. > is used to redirect standard output to a file. Try it in the molecules/ directory!

Exercise 2.4.2.2: An example pipeline: Checking Files

There are 17 files from an assay in the ~/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro/north-pacific-gyre/2012-07-03 directory. Suppose you want to do some quick sanity checks on the content of the files. You know that the files are supposed to have 300 lines. Starting by moving to that directory,

  1. How would you check if there are any files with fewer than 300 lines in the directory?
  2. How would you check if there are any files with more than 300 lines in the directory?
  1. wc -l *.txt | sort -n | head -n 5. You can report the number of lines of all the text files in the directory, sort them, and then get the top (if any files have fewer than 300 lines they will appear here).
  2. wc -l *.txt | sort -n -r | head -n 5. Same as above but now we want to sort the files in reverse order.

This idea of linking programs together is why Unix has been so successful. Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that each do one job well, and that work well with each other. This programming model is called “pipes and filters”.

We’ve already seen pipes; a filter is a program like wc or sort that transforms a stream of input into a stream of output. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input, do something with what they’ve read, and write to standard output.

The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.

Exercise 2.4.2.3: Pipe Reading Comprehension

A file called bacteria.txt (in the 02_unix_intro folder) contains the following data:

2023-01-24,Escherichia_coli
2023-01-24,Mycobacteria_tuberculosis
2023-01-24,Salmonella_enterica
2023-01-24,Staphylococcus_aureus
2023-01-25,Streptococcus_pneumoniae
2023-01-25,Pseudomonas_aeruginosa
2023-01-25,Haemophilus_influenzae 
2023-01-26,Carsonella_ruddii
2023-01-26,Escherichia_coli

What text passes through each of the pipes and the final redirect in the pipeline below?

cat bacteria.txt | head -n 5 | tail -n 3 | sort -r > final.txt

Hint: build the pipeline up one command at a time to test your understanding

The head command extracts the first 5 lines from bacteria.txt. Then, the last 3 lines are extracted from the previous 5 by using the tail command. With the sort -r command those 3 lines are sorted in reverse order and finally, the output is redirected to a file final.txt. The content of this file can be checked by executing cat final.txt. The file should contain the following lines:

2023-01-25,Streptococcus_pneumoniae
2023-01-24,Staphylococcus_aureus
2023-01-24,Salmonella_enterica
Exercise 2.4.2.4: Pipe Construction

For the file bacteria.txt from the previous exercise, consider the following command:

cut -d , -f 2 bacteria.txt

The cut command is used to remove or “cut out” certain sections of each line in the file. The optional -d flag is used to define the delimiter. A delimiter is a character that is used to separate each line of text into columns. The default delimiter is Tab, meaning that the cut command will automatically assume that values in different columns will be separated by a tab. The -f flag is used to specify the field (column) to cut out. The command above uses the -d option to split each line by comma, and the -f option to print the second field in each line, to give the following output:

Escherichia_coli
Mycobacteria_tuberculosis
Salmonella_enterica
Staphylococcus_aureus
Streptococcus_pneumoniae
Pseudomonas_aeruginosa
Haemophilus_influenzae 
Carsonella_ruddii
Escherichia_coli

The uniq command filters out adjacent matching lines in a file. How could you extend this pipeline (using uniq and another command) to find out what bacteria the file contains (without any duplicates in their names)?

cut -d , -f 2 bacteria.txt | sort | uniq
Awk: a more powerful tool for text processing

We have seen the cut command that allows the selection of columns in tabular data. If you need more powerful manipulation of tabular data you can use the command awk, which permits more powerful operations (selection, calculations etc.) on columns. For more complex operations, however, we recommend going to your favourite programming language!

Exercise 2.4.2.5: Which Pipe?

The file bacteria.txt contains 9 lines of data formatted as follows:

2023-01-24,Escherichia_coli
2023-01-24,Mycobacteria_tuberculosis
2023-01-24,Salmonella_enterica
2023-01-24,Staphylococcus_aureus
2023-01-25,Streptococcus_pneumoniae
...

The uniq command has a -c option which gives a count of the number of times a line occurs in its input. Assuming your current directory is 02_unix_intro/, what command would you use to produce a table that shows the total count of each type of animal in the file?

  1. sort bacteria.txt | uniq -c
  2. sort -t, -k2,2 bacteria.txt | uniq -c
  3. cut -d, -f 2 bacteria.txt | uniq -c
  4. cut -d, -f 2 bacteria.txt | sort | uniq -c
  5. cut -d, -f 2 bacteria.txt | sort | uniq -c | wc -l

Option 4. is the correct answer. If you have difficulty understanding why, try running the commands, or sub-sections of the pipelines (make sure you are in the 02_unix_intro/ directory).

Exercise 2.4.2.6: Filtering by patterns

grep is another command that searches for patterns in text. Patterns could be simple text or a combination of text and the wildcard characters we have seen before like ? and *. Like > other commands we have seen grep can be used on multiple files. For example if we wanted to find all occurrences of name in all the text files we could write:

grep "name" *.txt

Using the bacteria.txt file suppose we wanted to copy all the Escherichia_coli dates to a separate file Escherichia_coli.txt. Which combination of commands would achieve this?

grep “Escherichia_coli” bacteria.txt | cut -d, -f 1 > Escherichia_coli.txt

2.4.3 More exercises on | Pipe

Exercise 2.4.3.1: Pipe Comprehension

If you had the following two text files:

cat bacteria_1.txt
Escherichia_coli
Mycobacteria_tuberculosis
Salmonella_enterica
Mycobacteria_tuberculosis
cat bacteria_2.txt
Escherichia_coli
Staphylococcus_aureus
Mycobacteria_tuberculosis
Streptococcus_pneumoniae

What would be the result of the following command?

cat bacteria*.txt | head -n 6 | tail -n 1

The result would be “Staphylococcus_aureus”. Here is a diagram illustrating what happens in each of the three steps:

                    Escherichia_coli
                    Mycobacteria_tuberculosis                 Escherichia_coli                        
cat bacteria*.txt   Salmonella_enterica         head -n 6     Mycobacteria_tuberculosis    tail -n 1
----------------->  Mycobacteria_tuberculosis  ----------->   Salmonella_enterica        ----------->  Staphylococcus_aureus
                    Escherichia_coli                          Mycobacteria_tuberculosis
                    Staphylococcus_aureus                     Escherichia_coli
                    Mycobacteria_tuberculosis                 Staphylococcus_aureus
                    Streptococcus_pneumoniae                 
  • cat bacteria*.txt would combine the content of both files, and then
  • head -n 6 would print the first six lines of the combined file, and then
  • tail -n 1 would return the last line of this output.
Exercise 2.4.3.2: zcat and grep

In the 02_unix_intro/ directory, you will find a file named MTB_H37Rv_selected_proteins.fasta.gz. This is a file that contains the amino acid sequences of some proteins from Mycobacterium tuberculosis in a text-based format known as FASTA. However, this file is compressed using an algorithm known as GZip, which is indicated by the file extension .gz.
To look inside compressed files, you can use an alternative to cat called zcat (the ‘z’ at the beginning indicates it will work on zipped files).

  1. Use zcat together with less to look inside this file.
    Hint Remember you can press Q to exit the less program.
  2. The content of this file may look a little strange, if you’re not familiar with the FASTA file format. But basically, each protein sequence name starts with the > symbol. Combine zcat with grep to extract the sequence names only. How many proteins are in the file?

Task 1

The following command allows us to “browse” through the content of this file:

zcat MTB_H37Rv_selected_proteins.fasta.gz | less

NB. Mac users may have to use this version of the command zcat < TB_H37Rv_selected_proteins.fasta.gz | less to read compressed files by directing the file to zcat with <.

We can use and to move line-by-line or the Page Up and Page Down keys to move page-by-page. You can exit less by pressing Q (for “quit”). This will bring you back to the console.


Task 2

We can look at the sequences’ names by running:

zcat MTB_H37Rv_selected_proteins.fasta.gz | grep ">"
>gyrB locus_tag="Rv0005"
>gyrA locus_tag="Rv0006"
>pknB locus_tag="Rv0014c"
>pknA locus_tag="Rv0015c"
>pbpA locus_tag="Rv0016c"
>whiB5 locus_tag="Rv0022c"
>rpoB locus_tag="Rv0667"
>rpoA locus_tag="Rv3457c"
>inhA locus_tag="Rv1484"
>esxT locus_tag="Rv3444c"

We could further count how many sequences, by piping this output to wc:

zcat < MTB_H37Rv_selected_proteins.fasta.gz | grep ">" | wc -l
10

More on Cut, Sort, Unique & Count

Let’s now explore a few more useful commands to manipulate text that can be combined to quickly answer useful questions about data we may generate later on in this course.

Let’s start with the command cut, which is used to extract sections from each line of its input. For example, let’s say we wanted to retrieve only the second field (or column) of our CSV files in the 02_unix_intro directory, which contains the species classification taxonomy_id of reads obtained from sequencing the sample `TBNmA041:

cat TBNmA041*.csv | cut -d "," -f 2
taxonomy_id
1773
78331
1764
701042
339268
1767
222805
1768
722731
1545728
...

The two options used with this command are:

  • -d defines the delimiter used to separate different parts of the line. Because this is a CSV file, we use the comma as our delimiter. The tab is used as the default delimiter.
  • -f defines the field or part of the line we want to extract. In our case, we want the second field (or column) of our CSV file. It’s worth knowing that you can specify more than one field, so for example if you had a CSV file with more columns and wanted columns 3 and 7 you could set -d 3,7.

The next command we will explore is called sort, which sorts the lines of its input alphabetically (default) or numerically (if using the -n option). Let’s combine it with our previous command to see the result:

Remember we have to add the sort option -n to achieve the required output.

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n
taxonomy_id
1764
1764
1767
1768
1769
1773
1773
1781
1784
29311
78331
78331
212767
...

You can see that the output is now sorted as desired.

The cut command is often used in conjunction with another command: uniq. This command returns the unique lines in its input. Importantly, it only works as intended if the input is sorted. That’s why it’s often used together with sort.
Let’s see it in action, by continuing building our command:

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n | uniq
taxonomy_id
1764
1767
1768
1769
1773
1781
1784
29311
78331
212767
222805
339268
482462
701042
722731
1545728
1682113
1879023
1936029

We can see that now the output is de-duplicated, so only unique values are returned. And so, with a few simple commands, we’ve answered a very useful question from our data: what are the unique species in our collection of samples?

Illustration of the sort + uniq commands by Julia Evans

Exercise 2.4.3.3: Sort & Count I

Let’s continue working on our command:

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n | uniq

As you saw, this output also returns a line called “taxonomy_id”. This was part of the header (or column name) of our CSV file, which is not really useful to have in our output.
Let’s try and solve that problem, and also ask the question of how frequent each of these species are in our data.

  1. Looking at the help page for grep (grep --help or man grep), see if you can find an option to invert a match, i.e. to return the lines that do not match a pattern. Can you think of how to include this in our pipeline to remove that line from our output?
    Hint The option to invert a match with grep is -v. grep -v "taxonomy_id" would return the lines that do not match the word “taxonomy_id”.
  2. The uniq command has an option called -c. Try adding that option to the command and infer what it does (or look at uniq --help).
  3. Finally, produce a sorted table of counts for each of our taxonomic id in descending order (the most common species at the top).
    Hint The sort command has an option to order the output in reverse order: -r.

Task 1

Looking at the help of this function with grep --help, we can find the following option:

 -v, --invert-match        select non-matching lines

So, we can continue working on our pipeline by adding another step at the end:

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n | uniq | grep -v "taxonomy_id"
1764
1767
1768
1769
1773
1781
1784
29311
...

This now removes the line that matched the word “taxonomy_id”.


Task 2

Looking at the help for uniq --help, we can see that:

 -c, --count           prefix lines by the number of occurrences

So, if we add this option, we will get the count of how many times each unique line appears in our data:

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n | uniq -c | grep -v "taxonomy_id"
   2 1764
   1 1767
   1 1768
   1 1769
   2 1773
   1 1781
   1 1784
   1 29311
   2 78331
   1 212767
   1 222805
...   

Task 3

Now that we’ve counted each of our species, we can again sort this result, this time by the counts value, by adding another sort step at the end:

cat TBNmA041*.csv | cut -d "," -f 2 | sort -n | uniq -c | grep -v "taxonomy_id" | sort -r
   2 78331
   2 701042
   2 482462
   2 339268
   2 1936029
   2 1773
   2 1764
   2 1682113
   2 1545728
   1 722731
   1 29311
   1 222805
   1 212767
   1 1879023
   1 1784
   1 1781
   1 1769
   1 1768
   1 1767

We used the option -r, which from the help page sort --help, says:

  -r, --reverse               reverse the result of comparisons
Exercise 2.4.3.4: Sort & Count II

This is an (optional) advanced exercise.

In the 02_unix_intro/ directory, you will find two files named G26832.gff3.gz and G26832.tsv. We have already encountered these type of files and hopefully are familiar with its contents. The .gff is a file containing the locations and other information of genes in the Mycobacterium tuberculosis genome in a standard text-based format called GFF.
This is a tab-delimited file, where the 3rd column contains information about the kind of annotated feature, the 4th and 5th columns contains the start and end positions on the contig and the 7th column contains the strand information. You can investigate its content with zcat G26832.gff3.gz | less -S. We use zcat because the file is compressed (we can tell from its extension ending with .gz)

Extracts from this file have been converted to a more friendly file format .tsv. The .tsv file also contain similar information but this time, with the gene names spelt out distinctly in column 7. The .tsv file can easily be open using excel or libre office. For the tasks below, you may want to use the .tsv file to carry out the analysis.

Using a combination of the commands we have seen so far:

  1. Count how many occurrences of each feature (2nd column in .tsv) there is in the file.

  2. How many times does the gene “menH” occur in the genomic?

  3. Which gene has the highest number of occurrences in the genome?

    1. How many distinct trna’s can you identify in the bacteria genome?
    2. Which of these occurs more than once in the genome?
    3. Do you see any unusual trna? What do you think their roles are?
Hint
  • Start by investigating the content of the file with cat G26832.tsv | head or open with a GUI tool. You will notice the first few lines of the file contain comments starting with # symbol. You should remove these lines before continuing.
  • Check the help for grep to remind yourself what the option is to return lines not matching a pattern.
  • Remember that the cut program uses tab as its default delimiter.
  • To see all trna’s you can use grep "_trna".

Task 1

We could use:

cat G26832.tsv | grep -v "#" | cut -f 2 | sort | uniq -c
4041 cds
   1 crispr
  21 ncRNA
  13 ncRNA-region
   1 oriC
   3 rRNA
   1 sorf
  46 tRNA
   1 tmRNA
  • We use grep to remove the first few lines of the file that start with # character.
  • We use cut to extract the third “field” (column) of the file. Because it’s a tab-delimited file, we don’t need to specify a delimiter with cut, as that is the default.
  • We use sort to order the features in alphabetical order.
  • Finally, uniq is used to return the unique values as well as count their occurrence (with the -c option).

Task 2

The answer is 14. We could use the following command by investigating column 7:

cat G26832.tsv | grep "menH" | cut -f 7 | wc -l

Task 3

The gene with the highest occurrence in the genome is acrR and it occurs 38 times.

We can find this out with the command below:

cat G26832.tsv | grep -v "#" | cut -f 7 | sort | uniq -c | sort -r | head -n 20

Task 4

Task 4a For task 4a, we will simply count the number of unique trna in the genome using the command below:

cat G26832.tsv | grep "_trna" | cut -f 7 | sort | uniq -c | wc -l

We observe 22 unique trna counts.

Task 4b For task 4b, we can identify the trna’s that occur more than once in the genome by simply adding another sort to the previous command instead of the wc -l.

We can do this with the command below:

cat G26832.tsv | grep "_trna" | cut -f 7 | sort | uniq -c | sort -r
   5 Ser_trna
   5 Leu_trna
   4 Arg_trna
   3 Val_trna
   3 Thr_trna
   3 Pro_trna
   3 Gly_trna
   3 Ala_trna
   2 Lys_trna
   2 Glu_trna
   2 Gln_trna
...

There are 11 trna's that occur more than once in the genome.

Task 4c fMet_trna and Ile2_trna may be unusual to you. Google and find out what they do.

Note that, by default the sort command will order input lines alphabetically. So, for example, if it received this as input:

10
2
1
20

The result of sorting would be:

1
10
2
20

Because that’s the alphabetical order of those characters. We can use the option sort -n to make sure it sorts these as numbers, in which case the output would be as expected:

1
2
10
20

But, you may be asking yourself: why did it work with the output of uniq without specifying -n?
This is because the output of uniq left-aligns all the numbers by prefixing the smaller numbers with a space, such as this:

10
 2
 1
20

And because the space character comes first in the computer’s “alphabet”, we don’t actually need to use the -n option.

Here’s the main message: always use the -n option if you want things that look like numbers to be sorted numerically (if the input doesn’t look like a number, then sort will just order them alphabetically instead).

2.4.4 Credit

Information on this page has been adapted and modified from the following source(s):

  • Gabriel A. Devenyi (Ed.), Gerard Capes (Ed.), Colin Morris (Ed.), Will Pitchers (Ed.),Greg Wilson, Gerard Capes, Gabriel A. Devenyi, Christina Koch, Raniere Silva, Ashwin Srinath, … Vikram Chhatre. (2019, July). swcarpentry/shell-novice: Software Carpentry: the UNIX shell, June 2019 (Version v2019.06.1).

  • https://github.com/cambiotraining/UnixIntro

  • https://github.com/cambiotraining/unix-shell