Java multithreading and files

Question

I'm working on project. One part of it is read given folders files. Program travels into deep and collects filenames and other info which i wrap into my own DFile class, and puts it into collection for further work. It worked when was singlethreaded (using recursive read), but I want to do that in multithreading perspective, ignoring the thing that disk IO and multithreading won't increase performance. I want it for learning purpose.

So far, I've been jumping from one decision to another, changing plans how it will be and can't get it good. Your help would be appreciated.

What I want, that I supply root folder name, and my program runs several minithreads (user defined number of threads for this purpose), each thread reads given folders content: - When it finds file, wraps it into DFile and puts into shared between threads collection - When it finds folder, puts folder (as File object) into jobQueue, for other available thread to take work on it.

I can't get this system correctly. I've been changing code, puting idea what classes should be from one class with static collections to many. So far few classes I am listing here:

DirectoryCrawler http://pastebin.com/8tVGpGT9

Won't publish rest of my work (maybe in other topic, because purpose of the program absolutely not covered here). Program should read folder and make a list of files in it, then sort it (where I'll probably use multithreading too), then search for same hashed files and there's constantly working thread for writing those equal file groups into result file. I don't need to gain any performance, files gonna be small, as at first I was working on speed, I don't need it now.

Any help regarding design of reading would be appreciated

EDIT:

So much of headache :((. Doesn't work correctly :( Here so far: crawler (like a minithread for reading one folder, found files goes to fileList which is in other class, and folders to queue) pastebin. com/AkJLAUhD

scanner class (Don't even know should it be runnable or no). DirectoryScanner (main, should control crawlers, hold main filelist) pastebin. com/2abGMgG9 .

DFile itself pastebin. com/8uqPWh6Z (something became wrong with hashing, now when sorting all get same hash.. worked .. (hashing is for other task unrelated)) .

Filelist past ebin. com/Q2yM6ZwS

testcode:

DirectoryScanner reader = new DirectoryScanner(4);
for (int i = 0; i < 4; i ++) {
    reader.runTask(new DirectoryCrawler("myroot", reader));
}
try {
    reader.kill();
    while (!reader.isDone()) {
        System.out.println("notdone");
    }
    reader.getFileList().print();
}

myroot is a folder with some files for test

Anything, i can't even think of should scanner be itself runnable, or only crawlers. Because while scanning I actualy don't want to start doing other stuff like sorting (because nothing to sort while not gathered all files) ..

There is no right answer for this, and Programmers is probably a better place for it, but you main problem is you appear to be trying to do too much in a thread. If it was me a I'd have a queue of tasks with a path argument / property, not a list of path / files. — Tony Hopkinson
– Tony Hopkinson, Commented Apr 14, 2012 at 11:32
@TonyHopkinson - yes, that is how to do it - queueing folder tasks. — Martin James
– Martin James, Commented Apr 14, 2012 at 12:44

Martin James · Accepted Answer · 2012-04-15 18:57:28Z

You need the Executor threadpool and some classes:

A Fsearch class. This contains your container for the results. It also has a factory method that returns an Ffolder, counting up a 'foldersOutstanding' counter, and an OnComplete that counts them back in by counting down 'foldersOutstanding':

You need a Ffolder class to represent a folder and is passed its path as ctor parameter. It should have a run method that iterates is folder path that is supplied as a parameter along with the Fsearch instance.

Create and load up an Fsearch with the root folder and fire it into the pool. It creates a folder class, passing its root path and itslef, and loads that on. Then it waits on a 'searchComplete' event.

That first Ffolder iterates its folder, creating, (or depooling), DFiles for each 'ordinary' file and pushing them into the Fsearch container. If it finds a folder, it gets another Ffolder from the Fsearch, loads it with the new path and loads that onto the pool as well.

When an Ffolder has finished iterating its own folder, it calls the OnComplete' method of the Fsearch. The OnComplete is counting down the 'foldersOutstanding' and, when it is decremented to zero, all the folders have been scanned and files processed. The thread that did this final decrement signals the searchComplete event so that the Fsearch can continue. The Fsearch could call some 'OnSearchComplete' event that is was passed when it was created.

It goes almost without saying that the Fsearch callbacks must be thread-safe.

Such an exercise does not have to be academic. The container in the Fsearch, where all the DFiles go, could be a producer-consumer queue. Other threads could start processing the DFiles as the search is in progress, instead of waiting until the end.

I have done this before, (but not in Java), - it works OK. A design like this can easily do multiple searches in parallel - it's fun to issue an Fsearch for several hard drive roots at once - the clattering noise is impressive

Forgot to say - the big gain from such a design is when searching several networked drives with high latency. They can all be searched in parallel. The speedup over a miserable single-threaded sequential search is many times. By the time a single-thread seach has finished queueing up the DFiles for one drive, the multi-search has searched four drives and already had most of its DFiles processed.

NOTE:

1) If implemented strictly as above, the threadpool thread taht executes the FSearch is blocked on the 'OnSearchComplete' event until the search is over, so 'using up' one thread. There must therefore be more threadpool threads than live Fsearch instances else there will be no threads left over to do the actual searching, (yes, of course that happened to me:).

2) Unlike a single-thread search, results don't come back in any sort of predictable or repeatable order. If, for example, you signal your results as they come in to a GUI thread and try to display them in a TreeView, the path through the treeview component will likely be different for each result, updating the visual treeview will be lengthy. This can result in the Windows GUI input queue getting full, (10000 limit), because the GUI cannot keep up or, if using object pools for the Ffolder etc, the pool can empty, slugging performance and, if the GUI thread tries to get an Ffolder to issue a new search from the empty pool and so blocks, all-round deadlock with all Ffolder instances stuck in Windows messages, (yes, of course that happened to me:). It's best to not let such things happen!

Example - something like this I found - it's quite old Windows/C++ Builder code but it still works - I tried it on my Rad Studio 2009 , removed all the legacy/proprietary gunge and added some extra comments. All it does here is count up the folders and files, just as an example. There are only a couple of 'runnable' classes The myPool->submit() methods loads a runnable onto the pool and it's run() method gets executed. The base ctor has an 'OnComplete' EventHander, (TNotifyEvent), delgate parameter - that gets fired by the pool thread when the run() method returns.

//******************************* CLASSES ********************************

class DirSearch; // forward dec.

class ScanDir:public PoolTask{
    String FmyDirPath;
    DirSearch *FmySearch;
    TStringList *filesAndFolderNames;
public:                              // Counts for FmyDirPath only
    int fileCount,folderCount;
    ScanDir(String thisDirPath,DirSearch *mySearch);
    void run();                        // an override - called by pool thread
};


class DirSearch:public PoolTask{
    TNotifyEvent FonComplete;
    int dirCount;
    TEvent *searchCompleteEvent;
    CRITICAL_SECTION countLock;
public:
    String FdirPath;
    int totalFileCount,totalFolderCount;  // Count totals for all ScanDir's

    DirSearch(String dirPath, TNotifyEvent onComplete);
    ScanDir* getScanDir(String path);      // get a ScanDir and inc's count
    void run();                           // an override - called by pool thread
    void __fastcall scanCompleted(TObject *Sender); // called by ScanDir's
};

//******************************* METHODS ********************************

// ctor - just calls base ctor an initialzes stuff..
ScanDir::ScanDir(String thisDirPath,DirSearch *mySearch):FmySearch(mySearch),
        FmyDirPath(thisDirPath),fileCount(0),folderCount(0),
        PoolTask(0,mySearch->scanCompleted){};


void ScanDir::run()  // an override - called by pool thread
{
//  fileCount=0;
//  folderCount=0;
    filesAndFolderNames=listAllFoldersAndFiles(FmyDirPath); // gets files
    for (int index = 0; index < filesAndFolderNames->Count; index++)
    { // for all files in the folder..
        if((int)filesAndFolderNames->Objects[index]&faDirectory){
            folderCount++;  //do count and, if it's a folder, start another ScanDir
            String newFolderPath=FmyDirPath+"\\"+filesAndFolderNames->Strings[index];
            ScanDir* newScanDir=FmySearch->getScanDir(newFolderPath);
            myPool->submit(newScanDir);
        }
        else fileCount++; // inc 'ordinary' file count
    }
    delete(filesAndFolderNames); // don't leak the TStringList of filenames
};

DirSearch::DirSearch(String dirPath, TNotifyEvent onComplete):FdirPath(dirPath),
    FonComplete(onComplete),totalFileCount(0),totalFolderCount(0),dirCount(0),
    PoolTask(0,onComplete)
{
    InitializeCriticalSection(&countLock);  // thread-safe count
    searchCompleteEvent=new TEvent(NULL,false,false,"",false); // an event
                                        // for DirSearch to wait on till all ScanDir's done
};

ScanDir* DirSearch::getScanDir(String path)
{  // up the dirCount while providing a new DirSearch
    EnterCriticalSection(&countLock);
    dirCount++;
    LeaveCriticalSection(&countLock);
    return new ScanDir(path,this);
};

void DirSearch::run()  // called on pool thread
{
    ScanDir *firstScanDir=getScanDir(FdirPath); // get first ScanDir for top
    myPool->submit(firstScanDir);               // folder and set it going
    searchCompleteEvent->WaitFor(INFINITE);     // wait for them all to finish
}

/* NOTE - this is a DirSearch method, but it's called by the pool threads
running the DirScans when they complete.  The 'DirSearch' pool thread is stuck
on the searchCompleteEvent, waiting for all the DirScans to complete, at which
point the dirCount will be zero and the searchCompleteEvent signalled.
*/
void __fastcall DirSearch::scanCompleted(TObject *Sender){ // a DirSearch done
    ScanDir* thiscan=(ScanDir*)Sender;  // get the instance that completed back
    EnterCriticalSection(&countLock);   // thread-safe
    totalFileCount+=thiscan->fileCount;     // add DirSearch countst to totals
    totalFolderCount+=thiscan->folderCount;
    dirCount--;                           // another one gone..
    LeaveCriticalSection(&countLock);
    if(!dirCount) searchCompleteEvent->SetEvent(); // if all done, signal
    delete(thiscan);                      // another one bites the dust..
};

..and here it is, working:

Code working!

Good luck - I edited a couple points that I remember from my experiences, (unpleasant, sharp points:)
It's bit hard to understand at some points. Please Ive added some info to my first post. Please check my source, would be very very appreciated. I'm absolutely tired, whole day and can't get to work. Half of the day spent to sorting with parallel doing and got nothing, removed all sorts away...
Nice answer, I'd hope to be somewhere near it once I stopped making mistakes. +1 for effort

Radu Murzea · Accepted Answer · 2012-04-14 11:34:03Z

0

If you want to learn some multi-threading by doing some practical implementation, it would be best to pick something where the switch from a single-threaded activity to a multi-threaded one would actually make sense.

In this case, it doesn't make any sense. And, to achieve it, it would require some ugly pieces of code to be written. That is because you could have, for example, one thread handle just one subfolder (first level after the root folder). But what if you start with 200 subfolders ? Or more... Will 200 threads in that case make sense ? I doubt it...

answered Apr 14, 2012 at 11:34

Radu Murzea

11k10 gold badges50 silver badges72 bronze badges

3 Comments

Igor222 Over a year ago

I think I mentioned there's fixed userdefined number of threads. So if each thread when it becomes available takes other folder in queue and processes it. I realy know it won't help, but even man who gave me said he needs to see how I implement multithreading, don't care about performance etc.

Radu Murzea Over a year ago

If "the man" needs to see how you implement multi-threading, why don't you implement it in another part of the program (where it would make more sense) ? That's what I was trying to say...

Martin James Over a year ago

It can make sense! Processing of the files found does not have to wait until the search is complete - if the container for the DFile's is a producer-consumer queue, processing can start as soon as the first one is ready. Searches of several drives can be run in parallel - this is very important in the case of multiple networked drives.

Collectives™ on Stack Overflow

Java multithreading and files

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related