-1

So, I have got two directories personal_files and personal_files_oldcopy

Through some file processing, I am not sure if both directories contain the same structure, or additional / missing files are present. Assume that I don't much care about the actual content of the file, just their presence, and thus the potential difference in their file trees.

The directories each have a size of approximately 2TB, so an ordinary diff -r is not viable.

How to quickly, transparently and easily compare the two directories including their structures and their presence of named files?

Ideally, I want to know, which files and directories are included in one but not the other. Bonus points if by some checksumming magic or file size comparision, I can get a report about superficial difference of files that have the same location and name.

Example:

personal_files/
├── docs/
│   ├── resume.pdf
│   └── cover_letter.docx
├── music/
│   ├── rock/
│   │   └── song1.mp3
│   └── jazz/
│       └── smooth.mp3
└── notes.txt (Content: "I am a note")

personal_files_oldcopy/
├── docs/
│   ├── resume.pdf
│   ├── cover_letter.docx
│   └── old_portfolio.pdf
├── music/
│   ├── rock/
│   │   └── song1.mp3
│   └── pop/
│       └── hit_single.mp3
└── notes.txt (Content: "This is another different note")

Anyone wishing to recreate that directory structure to test with can do so by executing this script:

for dir in personal_files personal_files_oldcopy; do
    mkdir -p "$dir/docs"
    echo 'foo' > "$dir/docs/resume.pdf"
    echo 'foo' > "$dir/docs/cover_letter.docx"
    mkdir -p "$dir/music/rock"
    echo 'foo' > "$dir/music/rock/song1.mp3"
done

dir='personal_files'
mkdir -p "$dir/music/jazz"
echo 'foo' > "$dir/music/jazz/smooth.mp3"
echo 'I am a note' > "$dir/notes.txt"

dir='personal_files_oldcopy'
echo 'foo' > "$dir"/docs/old_portfolio.pdf
mkdir -p "$dir/music/pop"
echo 'foo' > "$dir/music/pop/hit_single.mp3"
echo 'This is another different note' > "$dir/notes.txt"

Output should be something like (obviously presentation might differ)

Only in personal_files_oldcopy/docs: old_portfolio.pdf

Only in personal_files/music/jazz: smooth.mp3
Only in personal_files_oldcopy/music/pop: hit_single.mp3

// optional:
Present in both, possible content different: notes.txt  (insert optional difference detection, checksum or size?)

// could be hidden:
Present in both: docs/resume.pdf
Present in both: docs/cover_letter.docx
Present in both: music/rock/song1.mp3

Potentially this could fold down to directory level if a whole directory is missing, instead of listing all the files within, but that would go more into a script territory, right?

8
  • 2
    Run find separately on both directories to get a list of everything you want to compare, then sort these outputs, so you can then compare both lists with diff. Commented Aug 15 at 20:25
  • 1
    Try vimdiff <(du -h personal_files) <(du -h personal_files_oldcopy) Commented Aug 15 at 20:26
  • I'd appreciate both of you guys, if you could expand your comments into whole answers with an example workflow, as both of them go a bit beyond my expertise and ability. Thanks :) Commented Aug 15 at 20:37
  • 1
    Clarified and done in original post! Commented Aug 15 at 21:03
  • 3
    rsync also provides some amazing directory/file comparison capabilities. You can use the -n (--dry-run) option to get a detailed listing of the differences in files between source and destination, based on size and modification time, not contents. (e.g. rsync -uavn <src> <dest>) Commented Aug 15 at 22:08

2 Answers 2

3

You can use diff -rq personal_files personal_files_oldcopy for this purpose:

  • diff -r will report files only in one of the directory tree and report files that are present in both but have differing contents.

  • -q prevents the full display of content differences which is probably what you describe as not viable

  • you could also add -s to also include the identical files in the output.

If you do not want to actually compare the files present in both trees, you could use this command:

( cd <dir> && find . -type f -ls ) | cut -f 7- -w  | sort -b -k 5 > <dir>.list

to create a text file of all the filenames along with their size and timestamp, sorted by the name field and stored in <dir>.list for each directory, then use diff to list the differences.

If you do not care about the timestamps and only keep the size and name, use this instead:

( cd <dir> && find . -type f -ls ) | cut -f 7,11- -w | sort -b -k 2  > <dir>.list

If you don't even care about sizes, here is an even simpler (and faster) command:

( cd <dir> && find . -type f ) | sort > <dir>.list
Sign up to request clarification or add additional context in comments.

3 Comments

Sorry, in fact I tried that already, the "non viable" part is that this actually seems to read through all the files and at quite a slow speed, which at 2TB is the thing that is not viable for me as it takes forever. I am asking about a surface level diff instead of a proper compare
@rappluk: so how do you want files present in both trees handled: listed as different if their lengths differ, as possibly different if only their timestamps differ and as probably identical otherwise?
to be honest that is inconsequential to me, I am fairly confident that files if present, didn't change between the two directories. Speed and reliability in detecting missing or extra files is important to me, if there's any possibility to superficially detect and list differences in the same file, I'm happy to take it but not looking for any specifically. I guess, file size might be a good enough indicator but again, I don't know if that is good for the speed requirement
3

This will show you the files and directories that do not exist in both directories:

$ diff <(find personal_files -printf '%P\n' | sort) <(find personal_files_oldcopy -printf '%P\n' | sort)
3a4
> docs/old_portfolio.pdf
6,7c7,8
< music/jazz
< music/jazz/smooth.mp3
---
> music/pop
> music/pop/hit_single.mp3

If you only want the lists of files, not directories then just add -type f to the finds:

$ diff <(find personal_files -type f -printf '%P\n' | sort) <(find personal_files_oldcopy -type f -printf '%P\n' | sort)
1a2
> docs/old_portfolio.pdf
3c4
< music/jazz/smooth.mp3
---
> music/pop/hit_single.mp3

If you want a rough guess at whether or not the files that do exist in both directories are the same, you could always include their sizes in the diff:

$ diff <(find personal_files -type f -printf '%P %s\n' | sort) <(find personal_files_oldcopy -type f -printf '%P %s\n' | sort)
1a2
> docs/old_portfolio.pdf 4
3c4
< music/jazz/smooth.mp3 4
---
> music/pop/hit_single.mp3 4
5c6
< notes.txt 12
---
> notes.txt 31

That assumes your file names cannot contain newlines, as in the provided example.

5 Comments

A few files in the directories are only accessible by root, so how to include sudo here?
Probably just before the diff, if not then put it before each find instead. Should be easy for you to test.
Oh yeah, I tested that of course. In front of the diff, it comes to late to let the find read the protected subfolders. Inside the parentheses, it is not possible for me to enter a password and the command exits with sudo: unable to read password: Input/output error sudo: a password is required
If all else fails you can always do sudo find ... > tmp1; sudo find ... > tmp2; diff tmp1 tmp2. The important part is just to use find to find the files then diff to compare the output of the 2 find commands.
Actually, can't you wrap the whole thing like sudo { diff ... } or put it all in a subshell sudo ( diff ... )?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.