I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?