Differences

This shows you the differences between two versions of the page.

--- parallel_rsync [20.02.2017 17:40] – Pascal Suter
+++ parallel_rsync [20.05.2020 19:44] (current) – [Before we get startet] Pascal Suter
@@ Line 21: / Line 21: @@
 how many jobs should run in parallel and how many directories deep you want to parallellize your jobs really depends on your sepcific situation. if you have several terabytes of data and you do a complete sync it makes sense to dive deeper into the structure than when you just want to update an already existing copy of the same data, in that case it might be faster to only dive 1 to 2 levels deep into your structure or even not use this script at all, when most of the time is spend by "creating incremental file list". really, read what's behind the script further down to understand how to parametrize it and how to modify it to adjust it to your specific situation
-==== known issues ====
+in the second version I have added the possibility to optionally pass a 5th and 6th argument. A filename can be passed as $5. If the file does not exist, the initial directory list which resulted from the ''find'' call at the begining of the script is saved to it. If the file exists, prsync will read the contents of the file and use it as directory list instead of re-running the whole ''find'' operation.
-  * when i wrote the script i used excessive escaping to make sure that more complex file names could also be copied. however i do not escape the source directory name or the target directory name. i sadly realized only after wrinting all this escaping madness, that i could have used the -0 option of xargs and corresponding null options to the other tools so that i could have avoided the need for all this escaping madness in the first place
+a second filename can be passed optionally as $6. prsync will save its progress to that file. if prsync is re-run, this file will be checked before the start of each rsync progress. in case the directory that was supposed to be rsynced is already on the list, it will be skipped. this can prevent re-running rsync for a large number of already synced directories to speed up resuming after an interrupted previous prsync run.
-<code>
+these two optional options should only be used if the source does not change between prsync runs. It is specially beneficial if the source storage in unstable and may crash after a certain period of time. using these two files will help to prevent unnecessary file scanning and comparing when resuming the prsync operation after a crash and hence help to advance the progress faster by minimizing unnecessary load on the storage.
+==== the code ====
+<code bash prsync.sh>
 #
-# Parallel Rsync function 2017 by Pascal Suter @ DALCO AG, Switzerland
+# Parallel Rsync function 2020 by Pascal Suter @ DALCO AG, Switzerland
+# documentation and explanation at http://wiki.psuter.ch/doku.php?id=parallel_rsync
+#
+# version 1: initial release in 2017
+# version 2: May 2020, removed the need to escape filenames by using
+#            null delimiter + xargs to run commands such as mkdir and rsync,
+#            added ability to resume without rescanning (argument $5) and to skip
+#            already synced directories (argument $6)
 #
@@ Line 32: / Line 43: @@
 	# $2 = destination
 	# $3 = dirdepth
 	# $4 = numjobs
+	# $5 = dirlist file (optional) --> will allow to resume without re-scanning the entire directory structure
+        # $6 = progress log file (optional) --> will allow to skip previously synced directory when resuming with a dirlist file
 	source=$1
 	destination=$2
 	depth=$3
 	threads=$4
+	dirlistfile=$5
+	progressfile=$6
-	# gets directory listing form remote or local using ssh and file
+	# gets directory listing form remote or local using ssh and find
 	dirlist(){
 		#$1 = path, $2 = maxdepth
@@ Line 65: / Line 79: @@
 			cd $path; find -type f -print0 | xargs -0 -P $threads -n 1 md5sum | sort -k 2
 		fi
-	}
-	# escape wrapper function. will do a double escape if the source is remote, will do a single escape if source is local
-	source_escape() {
-		echo "$source" | grep -P "^[^@]*@[^:]*:" > /dev/null
-		if [ $? -eq 0 ];then
-			escape | escape
-		else
-			escape
-		fi
-	}
-	#magic escape function. it is probably not yet complete but it can be expanded based on the last "final sync to double check"
-	#file names that where not or wrongly escaped end up there.
-	escape() {
-		sed -e 's/\\/\\\\/g' -e 's/ /\\ /g' -e 's/\$/\\\$/g' -e 's/:/\\:/g' -e 's/(/\\(/g' -e 's/)/\\)/g' -e 's/"/\\"/g' -e "s/'/\\\\'/g" -e 's/|/\\|/g'
 	}
 	# generate a list of directories to sync
-	rawfilelist=`dirlist $source $depth`
+	if [ -z "$dirlistfile" ]; then
+		rawfilelist=$(dirlist $source $depth)
+	else
+		# dirlist filename was passed check if it exists and load dirlist from there, otherwise create it and save the dirlist to the file
+		if [ -f $dirlistfile ]; then
+			rawfilelist=$(<$dirlistfile)
+		else
+			rawfilelist=$(dirlist $source $depth | tee $dirlistfile)
+		fi
+	fi
 	# separate paths less than DIRDEPTH deep from the others, so that only the "leafs" get rsynced recursively, the rest is synced without recursion
@@ Line 97: / Line 104: @@
 		remote=`echo "$path" | awk -F : '{print $1}'`
 		remotepath=${path:$((${#remote}+1))}
-		remotepath=`echo "$remotepath" | escape | escape`
+		echo -n -e "$remotepath\0" | ssh $remote "xargs -0 mkdir -p"
-		ssh $remote "mkdir -p $remotepath"
 	else
-		path=`echo "$path" | escape`
+		echo -n -e "$path\0" | xargs -0 mkdir -p
-		mkdir -p $path
 	fi
@@ Line 108: / Line 113: @@
 	echo "Sync parents"
 	echo "==========================================================================="
-	echo "$parentlist" | source_escape | xargs -P $threads -I PPP rsync -aHvx --numeric-ids --relative -f '- PPP/*/' $source/./'PPP'/ $destination/ 2>/tmp/debug
+	function PRS_syncParents(){
+		source=$2
+		destination=$3
+		progressfile=$4
+		if [ -n "$progressfile" ] && grep -q -x -F "$1" $progressfile ; then
+			echo "skipping $1 because it was synced before according to $progressfile"
+		else
+			echo -n -e "$1\0" | xargs -0 -I PPP rsync -aHvx --numeric-ids --relative -f '- PPP/*/' $source/./'PPP'/ $destination/ 2>/tmp/debug
+			status=$?
+			if [ -n "$progressfile" ]; then
+				echo "$1" >> "$progressfile"
+			fi
+			return $status
+		fi
+	}
+	export -f PRS_syncParents
+	echo "$parentlist" | tr \\n \\0 | xargs -0 -P $threads -I PPP /bin/bash -c 'PRS_syncParents "$@"' _ PPP "$source" "$destination" "$progressfile"
 	status=$?
 	if [ $status -gt 0 ]; then
@@ Line 116: / Line 137: @@
 		return 1
 	fi
 	#sync leafs recursively
 	echo "==========================================================================="
 	echo "Sync leafs recursively"
 	echo "==========================================================================="
-	echo "$filelist" | source_escape | xargs -P $threads -I PPP rsync -aHvx --relative --numeric-ids $source/./'PPP' $destination/ 2>/tmp/debug
+	function PRS_syncLeafs(){
+		source=$2
+		destination=$3
+		progressfile=$4
+		if [ -n "$progressfile" ] && grep -q -x -F "$1" $progressfile ; then
+			echo "skipping $1 because it was synced before according to $progressfile"
+		else
+			echo -n -e "$1\0" | xargs -0 -I PPP rsync -aHvx --relative --numeric-ids $source/./'PPP' $destination/ 2>/tmp/debug
+			status=$?
+			if [ -n "$progressfile" ]; then
+				echo "$1" >> "$progressfile"
+			fi
+			return $status
+		fi
+	}
+	export -f PRS_syncLeafs
+	echo "$filelist" | tr \\n \\0 | xargs -0 -P $threads -I PPP /bin/bash -c 'PRS_syncLeafs "$@"' _ PPP "$source" "$destination" "$progressfile"
+	status=$?
 	if [ $? -gt 0 ]; then
 		cat /tmp/debug
 		rm /tmp/debug
-		echo "ERROR: there was an error when syncing the leaf directories recursively, check messages and try again"
+		echo "ERROR: there was an error while syncing the leaf directories recursively, check messages and try again"
 		return 1
 	fi
+    #exit # uncomment for debugging what happenes before the final rsync
 	#run a single thread rsync across the entire project directory
@@ Line 139: / Line 177: @@
 		return 1
 	fi
+	exit # comment out if you want to really do the md5 sums, this may take very long!
 	#create an md5 sum of the md5sums of all files of the entire project directory to comapre it to the archive copy
@@ Line 156: / Line 196: @@
 **Usage**
 you can run this function like so:
-  psync sourceHost:/source/directory target/destination 5 8
+  source prsync.sh
-this will copy the /source/directory to /target/destination and it will dive 5 directory levels deep to parallelize rsyncs. it will run 8 rsync processes in parallel.
+  psync sourceHost:/source/directory target/destination 5 8 /tmp/dirlist /tmp/progressfile
+this will copy the /source/directory to /target/destination and it will dive 5 directory levels deep to parallelize rsyncs. it will run 8 rsync processes in parallel. with the optional ''dirlist'' and ''progressfile'' files, it will track its progress and skip all directories it has already rsynced when re-running it in case of an interrupted previous run.
 **catuion** this is a work in progress.. I am writing down my notes as I go!
@@ Line 165: / Line 206: @@
 here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company:
-===== Before we get startet =====
+==== testing ====
+to test this script when modifying, I use a simple test-dataset which I extract to ''/tmp/''. I then uncomment the "exit" statement before the "final sync to doublecheck" and run the script like so:
+  prsync.sh /tmp/source /tmp/target 3 1 /tmp/testdirlist /tmp/progressfile
+to compare the resulting structure i use diff:
+  diff <(find source/|sed -e 's/source//' | sort ) <(find target/ | sed -e 's/target//' | sort)
+and to delete the temporary files and target folder in order to re-run a fresh sync i run
+  rm -rf /tmp/target/* /tmp/testdirlist /tmp/progressfile
+===== Doing it manually =====
+Initially i did this manually to copy data from an old storage to a new one. when I later had to write a script to archive large directories with lots of small files, I decided to writhe the above function. So for those who are interested in reading more about the basic method and don't like my bash script, here is the manual way this all originated from :)
+==== Before we get startet ====
 one important note right at the begining: while parallelizing is certainly nice we have to consider, that spinning harddisks don't like concurrent file access. so be prepared to never ever see your harddisks theoretical throughput reached if you copy lots of small files.
 make sure you don't run too many parallel rsyncs by checking your cpu load with top. if you see the "wa" (waiting) load increase, it means you have too many processes. On the sytem i did this all for, first tried with 80 parallel rsyncs using option 2 below and i had a waiting load of about 50% and a througput of about 20MB/s. i then reduced to 15 parallel rsyncs and the waiting load went down to 25% and the bandwith went up to over 100MB/s. that is on a raid set that achieves a raw throughput of over 500MB/s if streaming performance is measured. just to give you an idea.