4

Say I have a large directory tree of large files on disc A. I backup that tree with rsync -a --delete /A /B. So far so good. Between backups there are some added files, some renamed ones, the usual.

Where it gets interesting is that A gets regularly reorganized: files move around (renamed, changed directories or both). So rsync ends up deleting files on B to copy them over again from A, and with large files through a network, that takes forever.

Is there some rsync option I could use ? I've re-read the option list and I couldn't find anything relevant, something to do with --size-only would be fine with me (low risk of collisions).

I think the solution is probably more to have a script look at file size + checksum and move files around on B before running rsync, but that's not too easy either. Ideas ?

2
  • 1
    If this is intended to be a backup, had you considered using actual backup tools? Backup tools like Restic or Borg that do block-level deduplication will intelligently handle this type of thing and only transfer data that has actually changed (and will generally compress things better too, which will save bandwidth as well). The downside to that of course is that you no longer have all the same structure replicated remotely, but depending on your usage that may not actually matter at all. Commented Dec 20, 2024 at 20:51
  • 1
    rsync doesn't make backups, it makes a single copy (which means you can't go back in time to a previous copy), and of course you can end up with an inconsistent state. It's good to keep copies in sync, but not for backups. If you want actual backups, use a backup tool. Commented Dec 21, 2024 at 12:38

2 Answers 2

4

There are two main avenues:

  1. the --fuzzy option, along with --compare-dest to specify additional directories to search for reference files.
  2. a directory full of hard links to large files that is transferred first, and triggers the hard link detection.

The former needs no changes to the file tree you are transferring, but requires you to build a list of directories that files are typically copied from, I'm not sure how well the algorithm scales if you have a long list of reference directories.

The more reliable approach requires a special directory that is transferred first, containing hard links to all the files that are worth optimizing. I'd do something like

find files -type f -size +2G | (while read f; do ln -f "$f" _links/`sha256sum "$f" | cut -d\  -f1`; done)

and then transfer _links before files -- that way.

None of these options are particularly great.

3

When files are moved within the same filesystem, they normally retain the same inode number. You could therefore create a list of pairs of inode and filename on disc A just after a backup, then form a new list just before a new backup. By comparing the two lists, you can create a list of move commands by matching up inodes where the filenames differ. Apply this list of moves to disc B before doing the backup. You may need to create new directories first, of course. Here's an explanatory shell script:

#!/bin/bash
list(){
  find A -type f -printf "%i %P\n" | # inode and pathname not including A
  sort
}

list >after
#... a few days later, before backup:
list >before
join -j 1 -o 1.2 2.2 after before | # same inode, old and new filename
awk '{if($1!=$2)printf "mv %s %s\n",$1,$2}' >cmds
(cd A; find . -type d -print0) |
(cd B; xargs -0 mkdir -p )
(cd B; sh -x ) <cmds
# now do backup, and at end:
mv before after

This is obviously only for simple cases. It does not handle directories or filenames with spaces and special characters, and it leaves the old directory names around. Perhaps it could be applied only to nicely named huge files (find ... -size +10M ...).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.