Differences

This shows you the differences between two versions of the page.

--- parallel_rsync [12.09.2016 14:09] – Pascal Suter
+++ parallel_rsync [20.02.2017 17:40] – Pascal Suter
@@ Line 1: / Line 1: @@
-====== Parallel Rsync (how I believe it's done) ======
+====== Parallel Rsync (my way) ======
+rsync is sooo cool, chances are, if you need to copy some files for whatever reason from one linux machine to another or even from one directory to another, rsync has everything you need. one thing though is terribly missing: parallelism
+when you copy files with rsync you often see an io performance (by using iotop for example) that is far below what your disks or your network connection are capable of.
+  - when you copy a directory containing many small files locally, rsync is slowed down by all the metadata operations it does (copying all the permissions, checking each file for changes by checking file dates etc.)
+  - when you copy files across a network, you are slowed down by a single threaded ssh process which can only use one cpu core for encrypting and decrypting data on that connection.
+my solution to that: run multiple rsync processes in parallel and leverage the power of several cpu cores in parallel.
+here is a bash script function I wrote which i can use in scripts to copy files from A to B through multiple connections.
+**Use this at your own risk** If you are interested in understanding what it does (and i strongly suggest you get interested in that before using this blindly!) you can read through my ideas on how to ideally parallelize rsync below the script.
+===== the bash function  =====
+what this function does is as follows: it runs a find across the source directory and gets a list of all files and directories within it. it then extracts a list of directories with a maximum directory depth of $3 (the 3rd argument to the function). it then queues these directories to sync them and runs $4 (the fourth argument) rsync processes in parallel to do so using xargs.
+once all directories have been synced it runs a single rsync thread as you would to just simply copy the files single threaded. only now we already copied all the files. this step is more some sort of a safety measure to make sure we really copied everything and that all file attributes are correct.
+once this step is passed it runs another sanity check and compares md5 sums between all source and target files. this might take very long and is not really necessary but since i programmed this function for an archive script that will copy files to an archive before they are deleted from the source i wanted to be 100% sure everything went okay :)
+how many jobs should run in parallel and how many directories deep you want to parallellize your jobs really depends on your sepcific situation. if you have several terabytes of data and you do a complete sync it makes sense to dive deeper into the structure than when you just want to update an already existing copy of the same data, in that case it might be faster to only dive 1 to 2 levels deep into your structure or even not use this script at all, when most of the time is spend by "creating incremental file list". really, read what's behind the script further down to understand how to parametrize it and how to modify it to adjust it to your specific situation
+==== known issues ====
+  * when i wrote the script i used excessive escaping to make sure that more complex file names could also be copied. however i do not escape the source directory name or the target directory name. i sadly realized only after wrinting all this escaping madness, that i could have used the -0 option of xargs and corresponding null options to the other tools so that i could have avoided the need for all this escaping madness in the first place
+<code>
+#
+# Parallel Rsync function 2017 by Pascal Suter @ DALCO AG, Switzerland
+#
+psync() {
+	# $1 = source
+	# $2 = destination
+	# $3 = dirdepth
+	# $4 = numjobs
+	source=$1
+	destination=$2
+	depth=$3
+	threads=$4
+	# gets directory listing form remote or local using ssh and file
+	dirlist(){
+		#$1 = path, $2 = maxdepth
+		path=$1
+		echo "$path" | grep -P "^[^@]*@[^:]*:" > /dev/null
+		if [ $? -eq 0 ]; then
+			remote=`echo "$path" | awk -F : '{print $1}'`
+			remotepath=${path:$((${#remote}+1))}
+			ssh $remote "find $remotepath/./ -maxdepth $2 -type d | perl -pe 's|^.*?/\./|\1|'"
+		else
+			find $1/./ -maxdepth $2 -type d | perl -pe 's|^.*?/\./|\1|'
+		fi
+	}
+	# get a sorted list of md5sums of all files in a directory (remote via ssh or local)
+	md5list(){
+		#$1 = path
+		path=$1
+		echo "$path" | grep -P "^[^@]*@[^:]*:" > /dev/null
+		if [ $? -eq 0 ]; then
+			remote=`echo "$path" | awk -F : '{print $1}'`
+			remotepath=${path:$((${#remote}+1))}
+			ssh $remote "cd $remotepath; find -type f -print0 | xargs -0 -P $threads -n 1 md5sum | sort -k 2"
+		else
+			cd $path; find -type f -print0 | xargs -0 -P $threads -n 1 md5sum | sort -k 2
+		fi
+	}
+	# escape wrapper function. will do a double escape if the source is remote, will do a single escape if source is local
+	source_escape() {
+		echo "$source" | grep -P "^[^@]*@[^:]*:" > /dev/null
+		if [ $? -eq 0 ];then
+			escape | escape
+		else
+			escape
+		fi
+	}
+	#magic escape function. it is probably not yet complete but it can be expanded based on the last "final sync to double check"
+	#file names that where not or wrongly escaped end up there.
+	escape() {
+		sed -e 's/\\/\\\\/g' -e 's/ /\\ /g' -e 's/\$/\\\$/g' -e 's/:/\\:/g' -e 's/(/\\(/g' -e 's/)/\\)/g' -e 's/"/\\"/g' -e "s/'/\\\\'/g" -e 's/|/\\|/g'
+	}
+	# generate a list of directories to sync
+	rawfilelist=`dirlist $source $depth`
+	# separate paths less than DIRDEPTH deep from the others, so that only the "leafs" get rsynced recursively, the rest is synced without recursion
+	i=$(($depth - 1))
+	parentlist=`echo "$rawfilelist" | sed -e '/^\(.*\/\)\{'$i'\}.*$/d'`
+	filelist=`echo "$rawfilelist" | sed -e '/^\(.*\/\)\{'$i'\}.*$/!d'`
+	# create target directory:
+	path=$destination
+	echo "$path" | grep -P "^[^@]*@[^:]*:" > /dev/null
+	if [ $? -eq 0 ]; then
+		remote=`echo "$path" | awk -F : '{print $1}'`
+		remotepath=${path:$((${#remote}+1))}
+		remotepath=`echo "$remotepath" | escape | escape`
+		ssh $remote "mkdir -p $remotepath"
+	else
+		path=`echo "$path" | escape`
+		mkdir -p $path
+	fi
+	#sync parents first
+	echo "==========================================================================="
+	echo "Sync parents"
+	echo "==========================================================================="
+	echo "$parentlist" | source_escape | xargs -P $threads -I PPP rsync -aHvx --numeric-ids --relative -f '- PPP/*/' $source/./'PPP'/ $destination/ 2>/tmp/debug
+	status=$?
+	if [ $status -gt 0 ]; then
+		cat /tmp/debug
+		rm /tmp/debug
+		echo "ERROR ($status): the was an error when syncing the parent directories, check messages and try again"
+		return 1
+	fi
+	#sync leafs recursively
+	echo "==========================================================================="
+	echo "Sync leafs recursively"
+	echo "==========================================================================="
+	echo "$filelist" | source_escape | xargs -P $threads -I PPP rsync -aHvx --relative --numeric-ids $source/./'PPP' $destination/ 2>/tmp/debug
+	if [ $? -gt 0 ]; then
+		cat /tmp/debug
+		rm /tmp/debug
+		echo "ERROR: there was an error when syncing the leaf directories recursively, check messages and try again"
+		return 1
+	fi
+	#run a single thread rsync across the entire project directory
+	#to make sure nothing is left behind.
+	echo "==========================================================================="
+	echo "final sync to double check"
+	echo "==========================================================================="
+	rsync -aHvx --delete --numeric-ids $source/ $destination/
+	if [ $? -gt 0 ]; then
+		echo "ERROR: there was a problem during the final rsync, check message and try again"
+		return 1
+	fi
+	#create an md5 sum of the md5sums of all files of the entire project directory to comapre it to the archive copy
+	echo "==========================================================================="
+	echo "sanity check"
+	echo "==========================================================================="
+	diff <( md5list $source ) <( md5list $destination )
+	if [ $? -gt 0 ]; then
+		echo "ERROR: the copy seems to be different from the source. check the list of files with different md5sums above. Maybe the files where modified during the copy process?"
+		return 1
+	fi
+	echo "SUCCESS: the entire directory $project has successfully been copied."
+}
+</code>
+**Usage**
+you can run this function like so:
+  psync sourceHost:/source/directory target/destination 5 8
+this will copy the /source/directory to /target/destination and it will dive 5 directory levels deep to parallelize rsyncs. it will run 8 rsync processes in parallel.
 **catuion** this is a work in progress.. I am writing down my notes as I go!
 **caution** please be careful with the instructions below and think it through yourself. I will take no responsibility for any data loss as a result of this article.
-rsync is sooo cool, chances are, if you need to copy some files for whatever reason from one linux machine to another or even from one directory to another, rsync has everything you need. one thing though is terribly missing: parallelism
 here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company: