parallel_rsync

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revisionBoth sides next revision
parallel_rsync [08.08.2016 21:21] Pascal Suterparallel_rsync [12.09.2016 14:09] Pascal Suter
Line 8: Line 8:
  
 here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company:  here is, how i did it when i needed to copy 40 TB of data from one raidset to another while the server was still online serving files to everybody in the company: 
 +
 +===== Before we get startet =====
 +one important note right at the begining: while parallelizing is certainly nice we have to consider, that spinning harddisks don't like concurrent file access. so be prepared to never ever see your harddisks theoretical throughput reached if you copy lots of small files.
 +make sure you don't run too many parallel rsyncs by checking your cpu load with top. if you see the "wa" (waiting) load increase, it means you have too many processes. On the sytem i did this all for, first tried with 80 parallel rsyncs using option 2 below and i had a waiting load of about 50% and a througput of about 20MB/s. i then reduced to 15 parallel rsyncs and the waiting load went down to 25% and the bandwith went up to over 100MB/s. that is on a raid set that achieves a raw throughput of over 500MB/s if streaming performance is measured. just to give you an idea. 
 +besides ''top'' you can also use ''iotop'' to monitor your overall rsync speed. 
  
 ===== Step 1: creat an incremental file list ===== ===== Step 1: creat an incremental file list =====
Line 24: Line 29:
 after waiting too long for Option 1 to finish on a system that carried tons of backups of other systems, i tried this option: \\ after waiting too long for Option 1 to finish on a system that carried tons of backups of other systems, i tried this option: \\
 if you have tons of files and want to skip the lengthy process of producing a file list via rsync, you can create a list of directories using find and then simply run an rsync per directory. this will give you the full parallelism at the begining but might end with a few ever lasting rsyncs if you don't dig deep enough when doing your initial directory list. still, this might save alot of time.  if you have tons of files and want to skip the lengthy process of producing a file list via rsync, you can create a list of directories using find and then simply run an rsync per directory. this will give you the full parallelism at the begining but might end with a few ever lasting rsyncs if you don't dig deep enough when doing your initial directory list. still, this might save alot of time. 
-  find /source/./ -maxdepth 5 -type d | perl -pe 's|^.*?/\./|\1|' > /tmp/filelist+  find /source/./ -maxdepth 5 -type d | perl -pe 's|^.*?/\./|\1|' > /tmp/rawfilelist
 with the ''--maxdepth'' option you can set how deep you want to dive into your directory tree.. the goal is to get directories with a rather small number of files so you don't have to wait too long for the last couple of rsyncs to finish. also note the added ''/./'' at the end of the source path. that's important as we need this to define to which point rsync should be relative. also check out the man page of rsync, i stole the idea from there ;) with the ''--maxdepth'' option you can set how deep you want to dive into your directory tree.. the goal is to get directories with a rather small number of files so you don't have to wait too long for the last couple of rsyncs to finish. also note the added ''/./'' at the end of the source path. that's important as we need this to define to which point rsync should be relative. also check out the man page of rsync, i stole the idea from there ;)
  
Line 45: Line 50:
 ===== Step 3: make sure we didn't miss anything ===== ===== Step 3: make sure we didn't miss anything =====
 probably the best feature about rsync is, that it resumes aborted previous jobs nicely and it can be run several times across the same source and target with no harm. so let's use this property to just fix everything we have missed or done wrong by simply running a single thread rsync in the end. now this can take some time, and I know no way around that.  probably the best feature about rsync is, that it resumes aborted previous jobs nicely and it can be run several times across the same source and target with no harm. so let's use this property to just fix everything we have missed or done wrong by simply running a single thread rsync in the end. now this can take some time, and I know no way around that. 
-  rsync -aHvx /source/ /target/+  rsync -aHvx --delete /source/ /target/
  
  
  • parallel_rsync.txt
  • Last modified: 20.05.2020 19:44
  • by Pascal Suter