Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
parallel_rsync [08.08.2016 19:57] – Pascal Suter | parallel_rsync [20.05.2020 19:39] – [the bash function] Pascal Suter | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Parallel Rsync (how I believe | + | ====== Parallel Rsync (my way) ====== |
+ | rsync is sooo cool, chances are, if you need to copy some files for whatever reason from one linux machine to another or even from one directory to another, rsync has everything you need. one thing though is terribly missing: parallelism | ||
+ | |||
+ | when you copy files with rsync you often see an io performance (by using iotop for example) that is far below what your disks or your network connection are capable of. | ||
+ | - when you copy a directory containing many small files locally, rsync is slowed down by all the metadata operations it does (copying all the permissions, | ||
+ | - when you copy files across a network, you are slowed down by a single threaded ssh process which can only use one cpu core for encrypting and decrypting data on that connection. | ||
+ | |||
+ | my solution to that: run multiple rsync processes in parallel and leverage the power of several cpu cores in parallel. | ||
+ | |||
+ | here is a bash script function I wrote which i can use in scripts to copy files from A to B through multiple connections. | ||
+ | |||
+ | **Use this at your own risk** If you are interested in understanding what it does (and i strongly suggest you get interested in that before using this blindly!) you can read through my ideas on how to ideally parallelize rsync below the script. | ||
+ | |||
+ | ===== the bash function | ||
+ | what this function does is as follows: it runs a find across the source directory and gets a list of all files and directories within it. it then extracts a list of directories with a maximum directory depth of $3 (the 3rd argument to the function). it then queues these directories to sync them and runs $4 (the fourth argument) rsync processes in parallel to do so using xargs. | ||
+ | |||
+ | once all directories have been synced it runs a single rsync thread as you would to just simply copy the files single threaded. only now we already copied all the files. this step is more some sort of a safety measure to make sure we really copied everything and that all file attributes are correct. | ||
+ | |||
+ | once this step is passed it runs another sanity check and compares md5 sums between all source and target files. this might take very long and is not really necessary but since i programmed this function for an archive script that will copy files to an archive before they are deleted from the source i wanted to be 100% sure everything went okay :) | ||
+ | |||
+ | how many jobs should run in parallel and how many directories deep you want to parallellize your jobs really depends on your sepcific situation. if you have several terabytes of data and you do a complete sync it makes sense to dive deeper into the structure than when you just want to update an already existing copy of the same data, in that case it might be faster to only dive 1 to 2 levels deep into your structure or even not use this script at all, when most of the time is spend by " | ||
+ | |||
+ | in the second version | ||
+ | a second filename can be passed optionally as $6. prsync will save its progress to that file. if prsync is re-run, this file will be checked before the start of each rsync progress. in case the directory that was supposed to be rsynced is already on the list, it will be skipped. this can prevent re-running rsync for a large number of already synced directories to speed up resuming after an interrupted previous prsync run. | ||
+ | |||
+ | these two optional options should only be used if the source does not change between prsync runs. It is specially beneficial if the source storage in unstable and may crash after a certain period of time. using these two files will help to prevent unnecessary file scanning and comparing when resuming the prsync operation after a crash and hence help to advance the progress faster by minimizing unnecessary load on the storage. | ||
+ | ==== the code ==== | ||
+ | |||
+ | <code bash prsync.sh> | ||
+ | # | ||
+ | # Parallel Rsync function 2020 by Pascal Suter @ DALCO AG, Switzerland | ||
+ | # documentation and explanation at http:// | ||
+ | # | ||
+ | # version 1: initial release in 2017 | ||
+ | # version 2: removed the need to escape filenames by using null delimiter + xargs to run commands such as mkdir and rsync, | ||
+ | # added ability to resume without rescanning (argument $5) and to skip already synced directories (argument $6) | ||
+ | # | ||
+ | |||
+ | psync() { | ||
+ | # $1 = source | ||
+ | # $2 = destination | ||
+ | # $3 = dirdepth | ||
+ | # $4 = numjobs | ||
+ | # $5 = dirlist file (optional) --> will allow to resume without re-scanning the entire directory structure | ||
+ | # $6 = progress log file (optional) --> will allow to skip previously synced directory when resuming with a dirlist file | ||
+ | source=$1 | ||
+ | destination=$2 | ||
+ | depth=$3 | ||
+ | threads=$4 | ||
+ | dirlistfile=$5 | ||
+ | progressfile=$6 | ||
+ | |||
+ | # gets directory listing form remote or local using ssh and find | ||
+ | dirlist(){ | ||
+ | #$1 = path, $2 = maxdepth | ||
+ | path=$1 | ||
+ | echo " | ||
+ | if [ $? -eq 0 ]; then | ||
+ | remote=`echo " | ||
+ | remotepath=${path: | ||
+ | ssh $remote "find $remotepath/ | ||
+ | else | ||
+ | find $1/./ -maxdepth $2 -type d | perl -pe ' | ||
+ | fi | ||
+ | } | ||
+ | |||
+ | # get a sorted list of md5sums of all files in a directory (remote via ssh or local) | ||
+ | md5list(){ | ||
+ | #$1 = path | ||
+ | path=$1 | ||
+ | echo " | ||
+ | if [ $? -eq 0 ]; then | ||
+ | remote=`echo " | ||
+ | remotepath=${path: | ||
+ | ssh $remote "cd $remotepath; | ||
+ | else | ||
+ | cd $path; find -type f -print0 | xargs -0 -P $threads -n 1 md5sum | sort -k 2 | ||
+ | fi | ||
+ | } | ||
+ | |||
+ | # generate a list of directories to sync | ||
+ | if [ -z " | ||
+ | rawfilelist=$(dirlist $source $depth) | ||
+ | else | ||
+ | # dirlist filename was passed check if it exists and load dirlist from there, otherwise create it and save the dirlist to the file | ||
+ | if [ -f $dirlistfile ]; then | ||
+ | rawfilelist=$(< | ||
+ | else | ||
+ | rawfilelist=$(dirlist $source $depth | tee $dirlistfile) | ||
+ | fi | ||
+ | fi | ||
+ | |||
+ | # separate paths less than DIRDEPTH deep from the others, so that only the " | ||
+ | i=$(($depth - 1)) | ||
+ | parentlist=`echo " | ||
+ | filelist=`echo " | ||
+ | |||
+ | # create target directory: | ||
+ | path=$destination | ||
+ | echo " | ||
+ | if [ $? -eq 0 ]; then | ||
+ | remote=`echo " | ||
+ | remotepath=${path: | ||
+ | echo -n -e " | ||
+ | else | ||
+ | echo -n -e " | ||
+ | fi | ||
+ | |||
+ | #sync parents first | ||
+ | echo " | ||
+ | echo "Sync parents" | ||
+ | echo " | ||
+ | function PRS_syncParents(){ | ||
+ | source=$2 | ||
+ | destination=$3 | ||
+ | progressfile=$4 | ||
+ | if [ -n " | ||
+ | echo " | ||
+ | else | ||
+ | echo -n -e " | ||
+ | status=$? | ||
+ | if [ -n " | ||
+ | echo " | ||
+ | fi | ||
+ | return $status | ||
+ | fi | ||
+ | } | ||
+ | export -f PRS_syncParents | ||
+ | echo " | ||
+ | status=$? | ||
+ | if [ $status -gt 0 ]; then | ||
+ | cat / | ||
+ | rm / | ||
+ | echo "ERROR ($status): the was an error when syncing the parent directories, | ||
+ | return 1 | ||
+ | fi | ||
+ | #sync leafs recursively | ||
+ | echo " | ||
+ | echo "Sync leafs recursively" | ||
+ | echo " | ||
+ | function PRS_syncLeafs(){ | ||
+ | source=$2 | ||
+ | destination=$3 | ||
+ | progressfile=$4 | ||
+ | if [ -n " | ||
+ | echo " | ||
+ | else | ||
+ | echo -n -e " | ||
+ | status=$? | ||
+ | if [ -n " | ||
+ | echo " | ||
+ | fi | ||
+ | return $status | ||
+ | fi | ||
+ | } | ||
+ | export -f PRS_syncLeafs | ||
+ | echo " | ||
+ | status=$? | ||
+ | if [ $? -gt 0 ]; then | ||
+ | cat / | ||
+ | rm / | ||
+ | echo " | ||
+ | return 1 | ||
+ | fi | ||
+ | #exit # uncomment for debugging what happenes before the final rsync | ||
+ | |||
+ | #run a single thread rsync across the entire project directory | ||
+ | #to make sure nothing is left behind. | ||
+ | echo " | ||
+ | echo "final sync to double check" | ||
+ | echo " | ||
+ | rsync -aHvx --delete --numeric-ids $source/ $destination/ | ||
+ | if [ $? -gt 0 ]; then | ||
+ | echo " | ||
+ | return 1 | ||
+ | fi | ||
+ | |||
+ | exit # comment out if you want to really do the md5 sums, this may take very long! | ||
+ | |||
+ | #create an md5 sum of the md5sums of all files of the entire project directory to comapre it to the archive copy | ||
+ | echo " | ||
+ | echo " | ||
+ | echo " | ||
+ | diff <( md5list $source ) <( md5list $destination ) | ||
+ | if [ $? -gt 0 ]; then | ||
+ | echo " | ||
+ | return 1 | ||
+ | fi | ||
+ | |||
+ | echo " | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | **Usage** | ||
+ | you can run this function like so: | ||
+ | source prsync.sh | ||
+ | psync sourceHost:/ | ||
+ | this will copy the / | ||
**catuion** this is a work in progress.. I am writing down my notes as I go! | **catuion** this is a work in progress.. I am writing down my notes as I go! | ||
**caution** please be careful with the instructions below and think it through yourself. I will take no responsibility for any data loss as a result of this article. | **caution** please be careful with the instructions below and think it through yourself. I will take no responsibility for any data loss as a result of this article. | ||
- | |||
- | rsync is sooo cool, chances are, if you need to copy some files for whatever reason from one linux machine to another or even from one directory to another, rsync has everything you need. one thing though is terribly missing: parallelism | ||
here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company: | here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company: | ||
+ | |||
+ | ==== testing ==== | ||
+ | to test this script when modifying, I use a simple test-dataset which I extract to ''/ | ||
+ | prsync.sh /tmp/source /tmp/target 3 1 / | ||
+ | to compare the resulting structure i use diff: | ||
+ | diff <(find source/|sed -e ' | ||
+ | and to delete the temporary files and target folder in order to re-run a fresh sync i run | ||
+ | rm -rf / | ||
+ | |||
+ | ===== Before we get startet ===== | ||
+ | one important note right at the begining: while parallelizing is certainly nice we have to consider, that spinning harddisks don't like concurrent file access. so be prepared to never ever see your harddisks theoretical throughput reached if you copy lots of small files. | ||
+ | make sure you don't run too many parallel rsyncs by checking your cpu load with top. if you see the " | ||
+ | besides '' | ||
===== Step 1: creat an incremental file list ===== | ===== Step 1: creat an incremental file list ===== | ||
Line 24: | Line 232: | ||
after waiting too long for Option 1 to finish on a system that carried tons of backups of other systems, i tried this option: \\ | after waiting too long for Option 1 to finish on a system that carried tons of backups of other systems, i tried this option: \\ | ||
if you have tons of files and want to skip the lengthy process of producing a file list via rsync, you can create a list of directories using find and then simply run an rsync per directory. this will give you the full parallelism at the begining but might end with a few ever lasting rsyncs if you don't dig deep enough when doing your initial directory list. still, this might save alot of time. | if you have tons of files and want to skip the lengthy process of producing a file list via rsync, you can create a list of directories using find and then simply run an rsync per directory. this will give you the full parallelism at the begining but might end with a few ever lasting rsyncs if you don't dig deep enough when doing your initial directory list. still, this might save alot of time. | ||
- | find /source/./ -maxdepth | + | find /source/./ -maxdepth |
with the '' | with the '' | ||
- | there is no cleaning needed here, as we really want the directory names to sync directories | + | |
+ | now it's time to clean up the list. we need to move all lines that contain less than the '' | ||
+ | cp / | ||
+ | cp / | ||
+ | sed -i '/ | ||
+ | sed -i '/ | ||
+ | **make sure that the number in the sed regex is your '' | ||
+ | now we need to sync the parents without recursion first before continuing to step 2 | ||
+ | cat / | ||
+ | the trick here is to use the '' | ||
===== Step 2: run Rsync with GNU prallel ===== | ===== Step 2: run Rsync with GNU prallel ===== | ||
now it's time to feed our filelist into rsync and run our parallel sync job. in order to parallelize rsync we use the GNU tool '' | now it's time to feed our filelist into rsync and run our parallel sync job. in order to parallelize rsync we use the GNU tool '' | ||
- | cat / | + | cat / |
- | note how, like in the above mentioned Option 2, we use the '/ | + | note how, like in the above mentioned Option 2, we use the '/ |
+ | note that parallel is thoroug as far as escaping goes. there are no quotes | ||
===== Step 3: make sure we didn't miss anything ===== | ===== Step 3: make sure we didn't miss anything ===== | ||
probably the best feature about rsync is, that it resumes aborted previous jobs nicely and it can be run several times across the same source and target with no harm. so let's use this property to just fix everything we have missed or done wrong by simply running a single thread rsync in the end. now this can take some time, and I know no way around that. | probably the best feature about rsync is, that it resumes aborted previous jobs nicely and it can be run several times across the same source and target with no harm. so let's use this property to just fix everything we have missed or done wrong by simply running a single thread rsync in the end. now this can take some time, and I know no way around that. | ||
- | rsync -aHvx /source/ /target/ | + | rsync -aHvx --delete |