1

I have a perl script which is used to process some data files from a given directory. I have written below bash script to look for the last updated file in the given directory and process that file.

cd $data_dir
find \( -type f -mtime -1 \) -exec ./script.pl {} \;

Sometimes, user copied multiple files to the data dir and hence the previous one skipped. The perl script execute only the last updated file. Can you please suggest me how to fix this using bash script.

2
  • What OS are you using? Does find without -exec show all the files you want to process? Commented Nov 29, 2010 at 8:58
  • OS is linux. Yes, it shows all the files which i want to execute. Commented Nov 29, 2010 at 9:18

3 Answers 3

3

Try

cd $data_dir
find \( -type f -mtime -1 \) -exec ./script.pl {} +

Note the termination of -exec with a + vs your \;

From the man page

-exec command {} +
This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end;

Now that you'll have one or more file names passed into your perl script, you can alter your perl script to iterate over each passed in file name.

Sign up to request clarification or add additional context in comments.

Comments

1

If I understood the question correctly, you need to process any files that were created or modified in a directory since the last time your script was run.

In my opinion find is not the right tool to determine those files, because it has no notion of which files it has already seen.

Using any of the -atime/-ctime/-mtime options will either produce duplicates if you run your script twice in the specified period, or miss some files if it is not executed at the right time. The timing intricacies of using these options for something like this are not easy to deal with.

I can propose a few alternatives:

a) Use three directories instead of one: incoming/ processing/ done/. Your users should only be allowed to put files in incoming/. You move any files in there to processing/ with a simple mv incoming/* processing/ before running your perl script. Then you move them from processing/ to done/ when its over.

In my opinion this is the simplest and best solution, and the one used by mail servers etc when dealing with this issue. If I were you and there were not any special circumstances preventing you from doing this, I'd stop reading here.

b) Have your finder script touch a special file (e.g. .timestamp, perhaps in a different directory, so that your users will not tamper with it) when it's done. This will allow your script to remember the last time it was run. Then use

find \( -cnewer .timestamp -o -newer .timestamp \) -type f -exec ./script.pl '{}' ';'

to run your perl script for each file. You should modify your perl script so that it can run repeatedly with a different file name each time. If you can modify it to accept multiple files in one go, you can also run it with

find \( -cnewer .timestamp -o -newer .timestamp \) -type f -exec ./script.pl '{}' +

which will minimise the number of ./script.pl processes. Take care to handle the first run of the find script, when the .timestamp file is missing. A good solution would be to simply ignore it by not using the -*newer options at all in that case. Also keep in mind that there is a race condition where files added after find was started but before touching the timestamp file will not be processed.

c) As a variation of (b), have your script update the timestamp with the time of the processed file that was created/modified most recently. This is tricky, because find cannot order its output on its own. You could use a wrapper around your perl script to handle this:

#!/bin/bash

for i in "$@"; do
    find "$i" \( -cnewer .timestamp -o -newer .timestamp \) -exec touch -r '{}' .timestamp ';'
done

./script.pl "$@"

This will update the timestamp if it is called to process a file with a newer mtime or ctime, minimising (but not eliminating) the race condition. It is however somewhat awkward - unavoidable since bash's [[ -nt option seems to only check the mtime. It might be better if your perl script handled that on its own.

d) Have your script store each processed filename and its timestamps somewhere and then skip duplicates. That would allow you to just pass all files in the directory to it and let it sort out the mess. Kinda tricky though...

e) Since your are using Linux, you might want to have a look at inotify and the inotify-tools package - specifically the inotifywait tool. With a bit of scripting it would allow you to process files as they are added in the directory:

inotifywait -e MOVED_TO -e CLOSE_WRITE -m -r testd/ | grep --line-buffered -e MOVED_TO -e CLOSE_WRITE | while read d e f; do ./script.pl "$f"; done

This has no race conditions, as long as your users do not create/copy/move any directories rather than just files.

2 Comments

"Using the -mtime option with a negative parameter will simply select all files." No, -mtime -1 selects files that were modified within the last 24 hours.
@Dennis: thanks for pointing that one out, I removed the sentence all together.
0

The perl script will only execute against the file which find gives it. Perhaps you should remove the -mtime -1 option from the find command so that it picks up all the files in the directory?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.