2.5 Using remote resources

Teaching: 15 min || Exercises: 5 min

Overview

Questions:
  • How can I work on the unix shell of a remote computer?
  • How can I move files between computers?
  • How can I access web resources using the command line?
  • How do I mount a directory from another computer onto my local filesystem?
  • When might different remote access tools be more appropriate for a task?
Learning Objectives:
  • Know how to securely connect and work from a remote computer.
  • Know how to copy files to/from a remote computer.
Key Points:
  • The ssh program can be used to securely login to a remote server. The general command is ssh username@remote, where username is the user’s name on the remote machine and remote is the name of that machine (sometimes in the form of an IP address)
  • scp a method of copying files using the ssh protocol. This is used to copy files to/from a remote server from your local machine. The syntax for this command is ssh username@remote:path_to_remote_file path_on_local_machine to copy a file from the remote machine to the local machine or ssh path_on_local_machine username@remote:path_to_remote_file vice-versa.
  • sshfs a method of using the ssh protocol to connect your local filesystem directly to a remote filesystem as if they were connected
  • rsync a different method of file transfer which only copies files that have changed
  • wget a command which can download files from http links as if it were in a browser
Warning

If you are a self-learner (rather than attending one of our workshops), you will need an account on a remote server to follow this section. You will need to adjust the commands shown on this page to match your username and hostname on that server.

2.5.1 Background

Let’s take a closer look at what happens when we use the shell on a desktop or laptop computer. The first step is to log in so that the operating system knows who we are and what we’re allowed to do. We do this by typing our username and password; the operating system checks those values against its records, and if they match, runs a shell for us.

As we type commands, the 1’s and 0’s that represent the characters we’re typing are sent from the keyboard to the shell. The shell displays those characters on the screen to represent what we type, and then, if what we typed was a command, the shell executes it and displays its output (if any).

What if we want to run some commands on another machine, such as the Noguchi server in the Prince Alwaleed building that manages our database of field results? To do this, we have to first log in to that machine. We call this a remote login.

In order for us to be able to login, the remote computer must be running a remote login server and we will run a client program that can talk to that server. The client program passes our login credentials to the remote login server and, if we are allowed to login, that server then runs a shell for us on the remote computer.

Once our local client is connected to the remote server, everything we type into the client is passed on, by the server, to the shell running on the remote computer. That remote shell runs those commands on our behalf, just as a local shell would, then sends back output, via the server, to our client, for our computer to display.

This phenomenon works just like accessing your email. In fact, if you have used an email address to send mails then you have unknowingly used a remote server. Have you wondered where all your email messages are sitting? Your guess is as good as mine!

2.5.2 The ssh protocol

The Secure SHell (SSH) is a protocol which allows us to send secure encrypted information across an unsecured network, like the internet. The underlying protocol supports a number of commands we can use to move information of different types in different ways. The simplest and most straightforward is the ssh command which facilitates a remote login session connecting our local user and shell to any remote user we have permission to access.

ssh sshuser@remote-machine

The first argument specifies the location of the remote machine (by IP address or a URL) as well as the user we want to connect to seperated by an @ sign. For the purpose of this course we’ve set up a container on our cluster for you to connect to the address remote-machine is actually a special kind of URL that only your computer will understand. In real life this would normally be an address on the internet which can be access from anywhere or at least an institutional local area network.

The authenticity of host '[192.168.1.59]:2231 ([192.168.1.59]:2231)' 
can't be established.
RSA key fingerprint is SHA256:4X1kUMDOG021U52XDL2U56GFIyC+S5koImofnTHvALk.
Are you sure you want to continue connecting (yes/no)?

When you connect to a computer for the first time you should see a warning like the one above. This signifies that the computer is trying to prove it’s identity by sending a fingerprint which relates to a key that only it knows. Depending on the security of the server you are connecting to they might distribute the fingerprint ahead of time for you to compare and advise you to double check it in case it changes at a later log on. In our case it is safe to type yes .

sshuser@192.168.1.59's password: ********

Now you are prompted for a password. Enter the appropriate password and hit enter.

    sshuser@remote_machine:~$

You should now have a prompt very similar to the one you started with but with a new username and computer hostname. Take a look around with the ls command and you should see that your new session has its own completely independent filesystem. Unfortunately None of the files are particularly relevant to us. Let’s change that, but first we need to go back to our original computer’s shell. Use the key combination Ctrl + D (Ctrl+D) on an empty command prompt to log out.

2.5.3 Moving files

ssh has a simple file copying counterpart called secure copy scp which uses all the same methods for authentication and encryption but focuses on moving files between computers in a similar manner to the cp command we learnt about before.

Making sure we’re in the 02_unix_intro directory let’s copy the bacteria.txt file to the remote machine

cd ~/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro
scp bacteria.txt sshuser@remote-machine:/home/user
Note

You will be prompted to provide your login credentials while attempting to copy to/from a remote source.

The format of the command should be quite familiar when comparing to the cp command for local copying. The last two arguments specify the source and the destination of the copy respectively. The difference comes in that any remote locations involved in the copy must be preceded by the username@IP syntax used in the ssh command previously. The first half tells scp how to access the computer and the second half tells it where in the filesystem to operate, these two segments are separated by a : .

A successful copy should produce something like this

bacteria.txt                                     100%   86   152.8KB/s   00:00

Now it looks like we’ve copied the file, but we should check.

Establishing a whole ssh session just to run one command might be a bit cumbersome. Instead we can tell ssh all the commands it needs to run at the same time we connect by adding an extra argument to the end. ssh will automatically disconnect after it completes the full command string.

ssh sshuser@remote-machine "ls /home/user/*.txt"
bacteria.txt

Success!

Exercise 2.5.3.1: How do we get files back from the remote server?

Using two commands we’ve uploaded a file to our server and checked it was there

scp bacteria.txt sshuser@remote-machine:/home/user
ssh sshuser@remote-machine "ls /home/user/*.txt"

More often we might want to have the server do some work on our files and then get them back. We can use exactly the same syntax structure with different arguments.

Try and make a change to bacteria.txt (perhaps add some text to the end with the echo command and >>) in the remote location and then retrieve the changed file. Remember to change the name of the file as you already have a bacteria.txt in your directory.

If you’re having trouble check man scp for more information about how to structure the command

ssh sshuser@remote-machine "echo all done >> bacteria.txt"
scp sshuser@remote-machine:bacteria.txt changed_bacteria.txt

2.5.4 Managing multiple remote files

Sometimes we need to manage a large number of files across two locations, often maintaining a specific directory structure. scp can handle this with it’s -r flag. Just like cp this puts the command in recursive mode allowing it to copy entire directories of files

scp -r sshuser@remote-machine:/home/user/workshop_files_Bact_Genomics_2023/02_unix_intro .
TB_H37Rv_truncated.fasta                        100%  150KB   1.9MB/s   00:00    
sample4_run2.fastq                              100%  372     7.0KB/s   00:00    
sample3_run2.fastq                              100%  372     7.6KB/s   00:00    
sample1_run2.fastq                              100%  370     8.4KB/s   00:00    
sample5_run2.fastq                              100%  372     7.5KB/s   00:00    
sample2_run2.fastq                              100%  370    12.7KB/s   00:00    
G26832.tsv                                      100% 1008KB   7.1MB/s   00:00    
G26832.tsv                                      100% 1008KB   5.7MB/s   00:00    
G26832.embl                                     100%   12MB  10.2MB/s   00:01    
G26832.gff3                                     100% 5698KB   9.6MB/s   00:00    
contigs.gfa                                     100% 4314KB   9.3MB/s   00:00    
G26832_R1.fastq.gz                              100%  218MB  10.6MB/s   00:20    
G26832.fna                                      100% 4329KB   9.7MB/s   00:00    
G26832.mlst.tsv                                 100%   89     1.3KB/s   00:00    
G26832.txt                                      100%  392     8.1KB/s   00:00    
G26832.ffn                                      100% 4005KB   9.4MB/s   00:00    
...    
Note

Did you realise the format of the above command was different from the previous scp command? Here, we copied files from the remote server unto our local machine. Check in your current directory if you will find the files you just copied.

However scp isn’t always the best tool to use for managing this kind of operation.

When you run scp it copies the entirety of every single file you specify. For an initial copy this is probably what you want, but if you change only a few files and want to synchronize the server copy to keep up with your changes it wouldn’t make sense to copy the entire directory structure again.

For this scenario rsync can be an excellent tool.

2.5.5 rsync

First lets add some new files to our 02_unix_intro directory using the touch command. This command does nothing but create an empty file or update the timestamp of an existing file.

touch newfile1 newfile2 newfile3

Now we have everything set up we can issue the rsync command to sync our two directories.

rsync -avz 02_unix_intro sshuser@remote-machine:/home/user/workshop_files_Bact_Genomics_2023/02_unix_intro

The -v flag/option stands for verbose and outputs a lot of status information during the transfer. This is helpful when running the command interactively but should generally be removed when writing scripts.

The -z flag/option means that the data will be compressed in transit. This is good practice depending on the speed of your connection (although in our case it is a little unnecessary).

Whilst we’ve used rsync in mostly the same way as we did scp it has many more customisation options as we can observe on the man page

man rsync

-a is useful for preserving filesystem metadata regarding the files being copied. This is particularly relevant for backups or situations where you intend to sync a copy back over your original at a later point.

--exclude and --include can be used to more granularly control which files are copied.

Finally --delete is very useful when you want to maintain an exact copy of the source including the deletion of files only present in the destination. Let’s tidy up our 02_unix_intro directory with this option.

Exercise 2.5.5.1: How to remove files using rsync --delete

First we should manually delete our local copies of the empty files we created.

rm newfile*

consulting the output of man rsync synchronize the local folder with the remote folder again. This time causing the remote copies of newfile* to be deleted.

keep in mind that for rsync a path with a trailing / means the contents of a directory rather than the directory itself. Think about how using the --delete flag could make it very dangerous to make this mistake.

Don’t worry too much though, you can always upload a new copy of the data.

rsync -avz --delete 02_unix_intro sshuser@remote-machine:/home/user/workshop_files_Bact_Genomics_2023/02_unix_intro

2.5.6 SshFs

Sshfs is another way of using the same ssh protocol to share files in a slightly different way. This software allows us to connect the file system of one machine with the file system of another using a “mount point”. Let’s start by making a directory in /home/ubuntu/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro to act as this mount point. Convention tells us to call it mnt.

mkdir /home/ubuntu/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro/mnt

Now we can run the sshfs command

sshfs sshuser@remote-machine:/home/user /home/ubuntu/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro/mnt/

It looks fairly similar to the previous copying commands. The first argument is a remote source, the second argument is a local destination. The difference is that now whenever we interact with our local mount point it will be as if we were interacting with the remote filesystem starting at the directory we specified /home/sshuser.

cd /home/ubuntu/Desktop/workshop_files_Bact_Genomics_2023/02_unix_intro/mnt/
ls -l
total 5824
-rw-r--r--@  1 user  staff  1626227 Nov 23 12:13 G26832.gff3.gz
-rw-r--r--@  1 user  staff  1031772 Nov 23 12:27 G26832.tsv
-rw-r--r--@  1 user  staff     3249 Nov 22 17:23 MTB_H37Rv_selected_proteins.fasta.gz
-rw-r--r--@  1 user  staff   153474 Nov 22 12:10 MTB_H37Rv_truncated.fasta
-rw-r--r--@  1 user  staff      784 Nov 22 17:59 TBNmA041_Species_firstset.csv
-rw-r--r--@  1 user  staff      869 Nov 22 17:59 TBNmA041_Species_secondset.csv
-rw-r--r--@  1 user  staff    37336 Nov 22 13:35 TBNmA041_annotation_truncated_1.gff3
-rw-r--r--@  1 user  staff    37336 Nov 22 13:35 TBNmA041_annotation_truncated_1.tsv
-rw-r--r--@  1 user  staff    53321 Nov 22 13:37 TBNmA041_annotation_truncated_2.gff3
-rw-r--r--@  1 user  staff      290 Nov 22 16:20 bacteria.txt
drwxr-xr-x@  9 user  staff      288 Nov 22 15:25 bacteria_rpob
drwxr-xr-x@ 10 user  staff      320 Nov 22 15:58 molecules
-rw-r--r--@  1 user  staff      554 Oct 12 14:59 morse.txt
drwxr-xr-x@  4 user  staff      128 Nov 21 17:44 mycobacteria_rpob
drwxr-xr-x@  4 user  staff      128 Nov 21 16:09 north-pacific-gyre
-rw-r--r--@  1 user  staff      141 Nov 21 18:03 nucleotides.txt
drwxr-xr-x@  7 user  staff      224 Nov 22 13:03 sequencing_run1
drwxr-xr-x@  7 user  staff      224 Nov 22 13:04 sequencing_run2

The files shown are not on the local machine at all but we can still access and edit them. Much like the concept of a shared network drive in Windows.

This approach is particularly useful when you need to use a program which isn’t available on the remote server to edit or visualize files stored remotely. Keep in mind however, that every action taken on these files is being encrypted and sent over the network. Using this approach for computationally intense tasks could substantially slow down the time to completion.

It’s also worth noting that this isn’t the only way to mount remote directories in linux. Protocols such as nfs and samba are actually more common and may be more appropriate for a given use case. Sshfs has the advantage of running over ssh so it require very little set up on the remote computer.

2.5.7 wget - Accessing web resources

wget is in a different category of command compared to the others in this section. It is mainly for accessing resources that can be downloaded over http(s) and doesn’t have a mechanism for uploading/pushing files.

Whilst this tool can be customised to do a wide range of tasks, at its simplest it can be used to download datasets for processing at the start of a processing pipeline.

The data for this course can be downloaded as follows

wget https://www.dropbox.com/sh/wgya3rngcl9nngm/AACYjpWnNSmqez_hw1kf0nDPa?dl=0 -O course_data.zip
--2022-11-23 14:20:41--  https://www.dropbox.com/sh/ggrruxq2w1zb3gn/AAD7lzWEuQlLISC_ZWqRybJia?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.64.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.64.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /sh/raw/ggrruxq2w1zb3gn/AAD7lzWEuQlLISC_ZWqRybJia [following]
--2022-11-23 14:20:42--  https://www.dropbox.com/sh/raw/ggrruxq2w1zb3gn/AAD7lzWEuQlLISC_ZWqRybJia
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com/zip_download_get/BUvw0Y7LsejbFNGF_CUwWx9ZoNLvYxLa58t5BMxPei6Asg_wxPm2W3Eb7v9EH4deBFDd6nh8VlQwFLrJoeMq1d0bZIHAa_80V9DTaoKZXsgukQ# [following]
--2022-11-23 14:20:47--  https://uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com/zip_download_get/BUvw0Y7LsejbFNGF_CUwWx9ZoNLvYxLa58t5BMxPei6Asg_wxPm2W3Eb7v9EH4deBFDd6nh8VlQwFLrJoeMq1d0bZIHAa_80V9DTaoKZXsgukQ
Resolving uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com (uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com)... 162.125.64.15
Connecting to uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com (uc2998e7e38ee57258e41fcd6c69.dl.dropboxusercontent.com)|162.125.64.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46293314 (44M) [application/zip]
Saving to: ‘course_data.zip’

course_data.zip                    100%[===============================================================>]  44.15M  9.96MB/s    in 5.5s    

2022-11-23 14:20:54 (8.03 MB/s) - ‘course_data.zip’ saved [46293314/46293314]

For large files or sets of files there are also a few useful flags/options

  • -b Places the task in the background and writes all output into a log file

  • -c can be used to resume a download that has failed mid-way

  • -i can take a file with a list of URLs for downloading

Where tools like wget shine in particular is in using URLs generated by web-based REST APIs like the one offered by Ensembl .

We can use bash to combine strings and form a valid query URL that meets our requirements.

wget https://rest.ensembl.org/genetree/member/symbol/homo_sapiens/BRCA2?prune_taxon=9526;content-type=text/x-orthoxml%2Bxml;prune_species=cow
Exercise 2.5.7.1: When should we use which tool?

Ultimately there is no hard right answer to this and any tool that works, works. That said, can you think of a scenario where you think each of the following might be the best choice.

scp rsync wget

scp - Any time you’re copying a file or set of files once and in one direction scp makes sense

rsync - This is a good choice when you need to have a local copy of the whole dataset but there is also relatively frequent communication to update the source/destination. It also makes sense when the dataset gets too large to transfer regularly as a whole. If your rsync use is getting complex consider separating the more dynamic files for sophisticated version control with something like git

wget - Whenever there is a single, canonical, mostly unchanging, web-based source for a piece of data wget is a good choice. Downloading all the data required for a script at the top with a series of wget commands can be good practice to organise your process. If you find wget limiting for this purpose a similar command called curl can be slightly more customisable for use in programming

Note

Although using these resources is very good especially being able to include them in scripts, currently, there are GUI applications that can enable you communicate just as good with remote servers. We will look at the vscode in our next lesson.

2.5.8 Credit

Information on this page has been adapted and modified from the following source(s):

  • https://github.com/cambiotraining/UnixIntro

  • https://github.com/cambiotraining/unix-shell