Transferring Files md5
Quite often we will need to transfer files to/from another computer or server. We can use what’s called an “md5 hash” to confirm a successful file transfer. A hash is basically a long alphanumeric string (e.g., beec7332e06b09a262ab252544cf7b60
) and is generated by running a compression algorithm on the file. Because different files will give different md5 hashes, you can confirm a successful transfer by ensuring that the md5 hashes are the same between the transferred file and the original file.
We recommend using GLOBUS (add link) to tranfer files. However, if you use another file tranfer mechanism, use md5sums to make sure your files transferred successfully.
Below, a general overview is given. Then an example of a parallel implementation for checking many files is given.
General Overview (serial implementation)
Step 1: Make a list of md5sums of your files
If md5 sums provided by the sequencing facility in a file: md5_sum.txt
(on mac is md5, on unix md5sum)
If not do:
md5sum (list files) > md5sum_original.txt
Step 2: Transfer files to new location (including the md5sum_original.txt)
After transferring files, change directory to that location (which could be on another computer/server).
Then, create an md5sum of the transferred files:
md5sum (list files in new location) > md5sum_new.txt
Step 3: Compare md5sum from before the transfer to md5sum after the transfer
Compare sorted files FILE1 and FILE2 line by line:
comm -3 md5sum_original.txt md5sum_new.txt
With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. -1 suppress lines unique to FILE1 -2 suppress lines unique to FILE2 -3 suppress lines that appear in both files
If the last command produces no output, the two files are identical.
Parallel implementation
Below are some examples of how to deploy md5 checksums in parallel. These instructions assume you are not on a login node of Discovery (i.e., you’re able to use more than one CPU - you may need to request an interactive job).
Step 1: Make a list of md5sums of your files
Instead of running each command serially (as in Step 1 above), we can put commands for multiple files into yet another file we’ll call get_hashes.sh
. Here are the first three lines of get_hashes.sh
:
get_hashes.sh
Note that the commands use >>
, this will append the output to the file md5sum_original.txt
, as opposed to using >
which would overwrite the file every command/line.
Then you can use GNU parallel and choose the number of CPUs you would like to use. For instance to use 36 CPUs:
cat get_hashes.sh | parallel -j 36 --progress --eta
Note: creating the file get_hashes.sh
is probably quicker if you use R (or another language) to create it by looping through your files, instead of writing each command/line by hand (including copying/pasting/editing). Here is an example in python:
create_get_hashes_sh_file.py
# get_files is a generic script that will return a list of the files for which you want hashes
files = get_files()
# make a list of commands
commands = []
for file in files:
commands.append(f'md5sum {file} >> /path/to/md5sum_original.txt')
# identify where you want commands to go
hash_file = '/path/to/get_hashes.sh'
# write commands to hash_file
with open(hash_file, 'w') as o:
o.write('\n'.join(commands))
Step 2
You can then do the same thing in Step 1, but for the transferred files, this time putting the hashes into md5sum_new.txt
.
Step 3: Compare md5sum from before the transfer to md5sum after the transfer
Step 3 here is the same as Step 3 above.