Create individual files via Bash with MD5 hashes of all files in a directory recursively

Question

I have an archive of photos stored in a directory tree on my Mac like:

./2016/05/17/photo-312.jpg
./2016/05/19/photo-1234.jpg
./2016/05/19/photo-5678.jpg

I want to create MD5 hashes of each file that can be used to verify the photos have not been altered or corrupted. My goals are:

One MD5 file per photo
Store the MD5 files in the same directory their corresponding photos
Use the same base name as the photo, but switch the extension to .md5
Capture only the hash value (e.g. b1046abbe7bbf2a2473e9489599f38e0) without any trailing spaces or newlines

For example, the above directory structure would look like this after the process runs:

./2016/05/17/photo-312.jpg
./2016/05/17/photo-312.md5
./2016/05/19/photo-1234.jpg
./2016/05/19/photo-1234.md5
./2016/05/19/photo-5678.jpg
./2016/05/19/photo-5678.md5

(Note: I only need to run this process one time. The process I use to move photos into the archive will create the necessary MD5 files for new photos from this point forward.)

Here's the one-liner I came up with:

find . -type f -name "*.jpg" -exec bash -c 'printf "%s" $(md5 -q "$0") > "${0%.*}.md5"' {} \;

(Note: my machine has md5 instead of md5sum which I often see referenced. So, I'm using that.)

Here's a few details on how I understand this to work:

The first section runs a basic find command on the current directory (i.e. ".") looking for .jpg files and sends them to bash with -exec bash -c
```
find . -type f -name "*.jpg" -exec bash -c 
```
Bash runs printf to setup for a string that doesn't have a newline:
```
printf "%s"
```
This section generates the hash that is used to feed the string into printf:
```
$(md5 -q "$0")
```
The -q flag tells md5 to output only the hash instead of the standard MD5 output which would look something line:

MD5 (photo-312.jpg) = b1046abbe7bbf2a2473e9489599f38e0

The value of $0 is the relative path to the source .jpg file that find sent to bash.
This section creates the file path to store the value in where the original extension is replaced by .md5:
```
"${0%.*}.md5"
```
More details about what's going on there can be found in the ${parameter%word} section of the Bash Manual.
The last little bit is:
```
{} \;
```
I'm not sure why, but the {} is necessary to make this run. (My understanding is that it's a reference to the file path. I don't know how that ties in, but md5: bash: No such file or directory errors pop up if it's not there.)

Finally, the \; identifies the end of find's -exec.

While I normally use other languages for this type of work, I decided to try this with bash to get some practice with it. I've done some basic testing and everything appears to work as expected. Given my infrequent use of bash, I'd like to make sure I'm not getting myself in trouble. So, my questions are:

Are there any gotchas in this code that are waiting to bite me?
Is there a more standard or efficient way to do this?

UPDATE: I modified my code based on the answers. In case it's useful, here's what I ended up with:

find . -type f \( -name '*.cr2' -or -name '*.jpg' \) -execdir sh -c 'sha1sum "{}" > "${1%.*}".sha1' -- {} \;

Which:

Allows for multiple file extension to be processed at the same time.
Uses -execdir instead of -exec so the default output of the hashing algorithm don't contain paths. (Which is one reasons I was trying to strip them originally).
Instead of md5 uses sh1sum which provides a sha1sum -c flag for verifying files and didn't require installation via homebrew.
Uses the more appropriate ${1%.*} (with the help of the -- at the end) instead of ${0%.*} to remove the initial file extension.

FYI: instead of md5 I moved to using sha1sum which seems to come installed by default on Macs running 10.12 and provides the -c for verification. — Alan W. Smith
– Alan W. Smith, Commented May 20, 2017 at 16:43

janos · Accepted Answer · 2017-05-20 06:25:43Z

A gotcha, sort of...

Although the one-liner works, the use of $0 is inappropriate. From man bash:

   -c        If the -c option is present, then commands are read from  the
             first non-option argument command_string.  If there are argu-
             ments  after  the  command_string,  the  first  argument   is
             assigned  to  $0  and any remaining arguments are assigned to
             the positional parameters.  The assignment  to  $0  sets  the
             name  of  the  shell, which is used in warning and error mes-
             sages.

That is, the file names to compute MD5 for are not appropriate values as the "shell". Positional arguments are in $1, $2, and so on, that would be appropriate for this purpose. You can fix that using the -- special argument, that signals the end of options and disables further option processing:

find ... -exec bash -c 'printf $(md5 -q "$1") > "${1%.*}.md5"' -- {} \;

Simplify

Is it really important to strip the .jpg at the end of filenames? Would it be terrible if the files looked like this?

./2016/05/17/photo-312.jpg
./2016/05/17/photo-312.jpg.md5
./2016/05/19/photo-1234.jpg
./2016/05/19/photo-1234.jpg.md5
./2016/05/19/photo-5678.jpg
./2016/05/19/photo-5678.jpg.md5

Because that would simplify the script a bit.

Is it really important to print only the MD5 digest without trailing newline? That would simplify the script a bit more. And that would get rid of the -q flag which is only supported by BSD's md5 tool, and not by GNU's md5sum. With this change, the script would be usable in Linux by defining md5 as an alias to md5sum.

Actually, it would be great to install GNU's md5sum (md5sha1sum package in Brew and MacPorts), because it has a -c flag to verify the file easily. For example, you would be able to verify the checksum of all files with:

find . -name '*.md5' -execdir md5sum -c {} \;

To create the files:

find . -type f -name "*.jpg" -execdir sh -c 'md5sum "{}" > "{}".md5' \;

Note that -execdir is necessary instead of -exec, so that the filenames don't have the directory part. Otherwise the .md5 files would contain the full path of files, and the md5sum -c verification would only work if invoked from the same relative path from where the digest file was created.

Use `sh` if good enough

The Bash script in -exec doesn't do anything Bash specific, so you could replace bash with sh.

Redundant `"%s"` in `printf`

Instead of printf "%s" something you could simply write printf something.

Thanks for notes! One reason I was stripping down to just the hash was because I was seeing paths in the normal MD5 generation. Thanks for the pointer to -execdir which takes care of that. -- There's no technical reason for me to remove the original extensions. Just aesthetic. Also, seemed like a good practice exercise for when it might matter in the future. -- I also ended up switching to sh1sum which seems to be installed by default and provides the -c flag for checking. — Alan W. Smith
– Alan W. Smith, Commented May 20, 2017 at 16:51

chicks · Accepted Answer · 2017-05-20 02:03:42Z

gotchas

The main issue I see with your one-liner is: what happens if you end up with a file that contains spaces? You have defined the input set so that it isn't a problem here, but including appropriate quoting or escaping to handle that would be a good idea in general. Typically you're taking the output of find and passing it along to something else so you end up with a find -print0 | xargs -0 sort of arrangement. xargs wouldn't easily work for your example though.
Since you're not doing any interpolation in -name "*.jpg" it would be clearer to use single quotes which don't do any magic inside: -name '*.jpg'

efficiency

You are invoking bash once per file. This can be a significant overhead if you're processing thousands of files. It might be good to turn the -exec'd part of things into its own script that can handle multiple arguments. Since you're only doing this once I wouldn't worry about it.

suggestions

While one-liners are cool and all: why not turn this into a script? That would make it easier to add error handling for arguments and keep your notes in comments. While it would slow it down slightly you could also add some progress indicator like printing the current file that it is operating on.
You can get free software things on the Mac that you are missing like md5sum from homebrew. Or just run Linux in a VM.

Thanks for the review! -- I think it handles spaces spaces in paths appropriately. I believe the quotes around $0 add that protection. (Just ran a couple tests that seemed to work fine, but I may be missing something) -- Good note on single quotes for the file extension. Making that change. -- And as to why I'm not turning this into a real script: mainly I just wanted to see if I could do this in bash and since I only need it one time, it seemed like a good exercise to get a little practice. — Alan W. Smith
– Alan W. Smith, Commented May 20, 2017 at 15:47

Stack Exchange Network

Create individual files via Bash with MD5 hashes of all files in a directory recursively

2 Answers 2

A gotcha, sort of...

Simplify

Use `sh` if good enough

Redundant `"%s"` in `printf`

gotchas

efficiency

suggestions

You must log in to answer this question.

Hot Network Questions

Create individual files via Bash with MD5 hashes of all files in a directory recursively

2 Answers 2

A gotcha, sort of...

Simplify

Use sh if good enough

Redundant "%s" in printf

gotchas

efficiency

suggestions

You must log in to answer this question.

Related

Hot Network Questions

Use `sh` if good enough

Redundant `"%s"` in `printf`