How can I process multiple files concurrently?

Question

I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.

Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.

Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?

Simeon G · Accepted Answer · 2011-09-23 04:24:51Z

1

In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here

answered Sep 23, 2011 at 4:24

Simeon G

1,1889 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shamik Over a year ago

thanks for your reply.This will probably help in figuring whether a new file has been dropped into the directory,but that doesn't really address my problem. As per the requirement, I need to move the files to an archive folder once they are processed. I can have the worker threads do file move once they are done, but that needs to be notified to the main thread.Reason being,in case of a failure in the worker process, that file needs to be re-tried again.

Shamik Over a year ago

@simeon..this feature is available in jdk 7, but I'm still in jdk 6. On doing some research, I found similar library called jpathwatch which provide similar functionality.I'll check it out. Thanks for the pointer.

Ryan Schipper · Accepted Answer · 2011-09-23 04:49:57Z

1

My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.

If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).

If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.

As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.

Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.

I've used this model in the past with success.

Here are some things to watch out for:

The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing

Hope this helps. Good luck.

edited Sep 23, 2011 at 4:49

answered Sep 23, 2011 at 4:44

Ryan Schipper

1,0297 silver badges8 bronze badges

1 Comment

Shamik Over a year ago

@Ryan...thanks for your pointers. I'm in java 6, so can't leverage WatchService.But I did take a look at jpathwatch, which is similar to WatchService. The monitors will raise an event when a file is being created. In my case, the tool which generates the warc files,first creates a temp file and write data to it. Once its done, it'll gzip the file. I'm hoping since this is a new file extension, the listener will treat it as a new file event.In that case,it'll be easy for me to track.

Collectives™ on Stack Overflow

How can I process multiple files concurrently?

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related