parallel_rsync

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
parallel_rsync [20.05.2020 19:39] – [the bash function] Pascal Suterparallel_rsync [20.05.2020 19:44] (current) – [Before we get startet] Pascal Suter
Line 33: Line 33:
 # #
 # version 1: initial release in 2017 # version 1: initial release in 2017
-# version 2: removed the need to escape filenames by using null delimiter + xargs to run commands such as mkdir and rsync,  +# version 2: May 2020, removed the need to escape filenames by using  
-#            added ability to resume without rescanning (argument $5) and to skip already synced directories (argument $6)+#            null delimiter + xargs to run commands such as mkdir and rsync,  
 +#            added ability to resume without rescanning (argument $5) and to skip  
 +#            already synced directories (argument $6)
 # #
  
Line 43: Line 45:
  # $4 = numjobs  # $4 = numjobs
  # $5 = dirlist file (optional) --> will allow to resume without re-scanning the entire directory structure  # $5 = dirlist file (optional) --> will allow to resume without re-scanning the entire directory structure
-    # $6 = progress log file (optional) --> will allow to skip previously synced directory when resuming with a dirlist file+        # $6 = progress log file (optional) --> will allow to skip previously synced directory when resuming with a dirlist file
  source=$1  source=$1
  destination=$2  destination=$2
Line 212: Line 214:
   rm -rf /tmp/target/* /tmp/testdirlist /tmp/progressfile   rm -rf /tmp/target/* /tmp/testdirlist /tmp/progressfile
  
-===== Before we get startet =====+===== Doing it manually ===== 
 +Initially i did this manually to copy data from an old storage to a new one. when I later had to write a script to archive large directories with lots of small files, I decided to writhe the above function. So for those who are interested in reading more about the basic method and don't like my bash script, here is the manual way this all originated from :)  
 + 
 +==== Before we get startet ====
 one important note right at the begining: while parallelizing is certainly nice we have to consider, that spinning harddisks don't like concurrent file access. so be prepared to never ever see your harddisks theoretical throughput reached if you copy lots of small files. one important note right at the begining: while parallelizing is certainly nice we have to consider, that spinning harddisks don't like concurrent file access. so be prepared to never ever see your harddisks theoretical throughput reached if you copy lots of small files.
 make sure you don't run too many parallel rsyncs by checking your cpu load with top. if you see the "wa" (waiting) load increase, it means you have too many processes. On the sytem i did this all for, first tried with 80 parallel rsyncs using option 2 below and i had a waiting load of about 50% and a througput of about 20MB/s. i then reduced to 15 parallel rsyncs and the waiting load went down to 25% and the bandwith went up to over 100MB/s. that is on a raid set that achieves a raw throughput of over 500MB/s if streaming performance is measured. just to give you an idea.  make sure you don't run too many parallel rsyncs by checking your cpu load with top. if you see the "wa" (waiting) load increase, it means you have too many processes. On the sytem i did this all for, first tried with 80 parallel rsyncs using option 2 below and i had a waiting load of about 50% and a througput of about 20MB/s. i then reduced to 15 parallel rsyncs and the waiting load went down to 25% and the bandwith went up to over 100MB/s. that is on a raid set that achieves a raw throughput of over 500MB/s if streaming performance is measured. just to give you an idea. 
  • parallel_rsync.1589996394.txt.gz
  • Last modified: 20.05.2020 19:39
  • by Pascal Suter