Java multithreaded file system tree traversal

Question

I have a little task for my interview (sic!).
I need to create java CLI programm that searchs for specific files matched some pattern. They said I need to use multi-threading approach without using util.concurrent package and to provide good performance on parallel controllers.

From my point of view it's pretty simple - I can create specific thread for each subfolder and run over it's contents, and for each subfolder run another Thread...
But it can't be so much easy :) Maybe someone can suggest me typical pitfalls that I shoud be avare of. Or any advice of correct way to implement this in Java will be highly appreciated.

Thanks!

UPD1
File name should match pattern.

Just one hint: creating a thread per folder/subfolder will result in way too much threading overhead and a situation called over-saturation (if you have many subfolders you'll have way to many threads). — Thomas
– Thomas, Commented Sep 21, 2011 at 11:51
Where exactly does that pattern apply, to the path/filename or the files' content? — Thomas
– Thomas, Commented Sep 21, 2011 at 11:54
Parallel disk controllers? One thread per controller seems logical. Multiple threads making seek requests to the disk doesn't seem like a good idea, so one thread per subdir might not perform better. — Qwerky
– Qwerky, Commented Sep 21, 2011 at 11:56
Using multiple threads per disk can make a small difference. I have seen up to 50% improvement (I have also seen it make no difference at all except complicate the code). Using multiple controllers or even multiple disks makes more difference. — Peter Lawrey
– Peter Lawrey, Commented Sep 21, 2011 at 12:07
My answer would be: file searching is likely to be IO-constrained, not processor-constrained. If the file search is on one physical disk, multi-threading is likely to lead to worse performance, not better performance, as the hard IO performance limits are still there but now you've added multi-threading overhead — Steve McLeod
– Steve McLeod, Commented Sep 21, 2011 at 12:15

Thomas · Accepted Answer · 2011-09-21 12:10:59Z

3

The problem would be to find a good number of threads and distributing works as equally as possbible.

Assuming you don't know the number of files and subdirectories in each folder, that could become quite tricky.

Here's an idea for a start:

What you might do is create a number of threads that operate on a central folder list and spawn a thread per folder you encounter up to a certain maximum. Each thread could then put the subfolders of the directory it works on to the central list and when it is done it might pick the next from that list.

If a folder is put on the list and the maximum number of threads is not reached, a new thread is spawned immediately.

If the thread has run out of work and the folder list is empty, it could either stop (requiring you to spawn a new one if needed) or wait until either there's a folder on the list or the application signals that all folders are processed.

Finally, don't forget to synchronize on the folder list.

Hope that helps you to get started.

Edit: (don't take the following too seriously :) )

You could also use another thread pool implementation that doesn't use the java.util.concurrent package :)

Edit 2: Basically what I described above is a simple and task specific thread pool implementation. You might try and look for more information on building a thread pool yourself (in the context of your assignment a thread pool task would be scanning one folder).

edited Sep 21, 2011 at 12:10

answered Sep 21, 2011 at 12:04

Thomas

88.9k13 gold badges126 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Cratylus Over a year ago

Not sure what you mean operate on a central folder list.How is that created?

Thomas Over a year ago

@user384706 it's basically just a list/queue of either strings (paths) or File objects that represent the folders. You'd put each sub folder of the folder you are processing into that list and let the next available thread take it from there (might be the same thread but you don't know). - With "central" I mean that each thread can access it either directly or indirectly, hence the need for synchronization.

Cratylus Over a year ago

So you need to traverse the whole directory structure initially to create the list in the first place?

Thomas Over a year ago

@user384706 no, you fill that list on-the-fly and add worker threads as needed. Whenever a thread encounters subdirectories it puts them on the list. At the beginning you'll have only one thread but as more and more subfolders are added you'll have to spawn additional worker threads until the maximum (normally defined by the number of cores on the machine) is reached.

Cratylus Over a year ago

+1.I think it is good idea but only as a sample for interview.I still hold the opinion that for tasks like this OS does it best.In any case the OP doesn't say if the search pattern is for the filename or inside the contents of the file.

|

Alistair A. Israel · Accepted Answer · 2011-09-21 15:23:56Z

Off the top of my head I can think of a couple of approaches which I'll begin to outline here. If you need a more detailed explanation, I'd be glad to elaborate. Note that I have NOT attempted to code to test either of these (yet), so take them with a grain of salt.

The first approach relies on a central dispatcher (the main thread) to coordinate a pool of worker threads and a 'work' queue (of folders). The central dispatcher initializes the thread pool to its minimum size and assigns the root folder to the first worker thread, then signals it to wake up.

Each worker thread starts in an idle state, looping until it receives a 'wake up' signal. It then checks if it's been assigned a folder to process, and begins iterating through that folder's contents. The worker thread submits any files that match the pattern to the 'found' list (or just prints it to System.out).

Any subfolders are added back into the 'work' queue (and the main thread is signalled). Afterwards, the worker thread reenters the 'idle' state (and the main thread is signalled again). Upon waking up, it should either be assigned a new folder which it starts processing the same way. Otherwise, the thread can terminate itself.

The main thread waits until it receives a signal from any worker thread. It first checks if the work queue is empty. If not, then it checks if there any worker threads that are idle. While there are any idle worker threads (and the work queue is non-empty), pop the first folder off the work queue and assign it to that thread, and signal it to wake up. The main thread then goes back to waiting.

If there are no idle threads, then check the current thread pool size against the maximum configured size. If the pool is already at maximum, then the main thread goes back to sleep (basically waiting until a thread becomes idle). Otherwise, while the pool isn't at maximum, create a new thread, add it to the pool, and assign it the first folder in the work queue, then wake it up.

If the main thread wakes up with the work queue empty, then first it checks if there are any idle threads and the worker thread pool is larger than the configured minimum. If so, the main thread can elect to wake up idle threads (with no assigned folder, so they'll terminate) and remove them from the pool. If the thread pool is already at a minimum, then again the main thread just goes back to waiting. Note that this 'downsizing' step may be unnecessary optimization (since worker threads shouldn't be consuming CPU cycles waiting anyway).

If all threads are idle and the work queue is empty then the main thread knows it's done and can signal all remaining worker threads to wake up and terminate themselves.

The trick here is all in the signaling between the main thread and the worker threads, and synchronization on the work queue (and possibly the 'found files' bin).

(Also, to simplify things a bit, one can opt for a fixed-size thread pool.)

The alternative approach doesn't involve a central dispatcher, but uses a fixed pool of worker threads sleeping for random intervals, periodically waking up to check if either a) the work queue has an item, or b) all work has been done.

The main thread just initializes the thread pool, places the root folder at the head of the work 'queue' then starts up all worker threads. It then waits for a signal telling it that all work is done and it can clean up (interrupt all the remaining worker threads to wake them up and terminate themselves).

Each worker thread starts up in a non-idle state and checks the work queue. The first thread should see the root folder, which it pops off the queue and begins processing it similar to the above. All the other threads upon starting should see an empty queue and go to sleep.

When a worker thread wakes up, it checks if the work queue has any folders remaining to be processed, and if so, works on it, setting its 'idle' flag to false. When it's done with the current folder, it checks the work queue again. If the work queue is empty, it sets its 'idle' flag.

When a worker thread finds an empty queue (either upon waking up or after processing), it checks all the other threads if they're 'idle'. If it finds at least one other thread still working then it goes to sleep for a random interval.

If all other threads are 'idle' and the work queue is empty, the thread can terminate itself. Before doing so, it interrupts or signals the main thread so the main thread can do the clean up for the rest of the other threads.

This approach can also be adapted to use a flexible sized thread pool that can dynamically grow or shrink as needed, but that just puts more complex logic into the worker threads which I won't go into for now.

I think last option will not fit. This task actually mentions that this programm should be much faster than single-threaded programm with abstract disk controllers. I will try first option.

John B · Accepted Answer · 2011-09-21 11:52:10Z

0

You might want to consider using an ExecutorService to keep control of the number of threads created / running at any one time.

answered Sep 21, 2011 at 11:52

John B

33.1k8 gold badges80 silver badges100 bronze badges

2 Comments

Sean Patrick Floyd Over a year ago

pretty tricky to do that "without using util.concurrent package"

John B Over a year ago

Dar! Guess I wouldn't get the job. ;)

Cratylus · Accepted Answer · 2011-09-21 13:12:46Z

0

You could use JNI and call the search process of the OS for the pattern.

My assumption is that the OS would do it more efficiently than you could ever do it.

Use multithreading just for spawing a the JNI and parse results.

answered Sep 21, 2011 at 13:12

Cratylus

54.3k73 gold badges222 silver badges348 bronze badges

2 Comments

Thomas Over a year ago

JNI would require platform dependent solutions which kind of counters the purpose of using Java. I assume that there's no other meaning behind the task than to test the OP's multithreading skills, without actually requiring top performance.

Cratylus Over a year ago

Well from my point of view in real life OS would do it best. You are right it is not portable but in the end it depends on what you are trying to do.I mentioned JNI because in Windows platforms various COM bridge libraries in Java are not easy to use.So my point is delegate to OS which will do it best.

Collectives™ on Stack Overflow

Java multithreaded file system tree traversal

4 Answers 4

6 Comments

1 Comment

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

1 Comment

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related