Performance optimization searching data in file system

Question

I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?

File searchFile(File location, String fileName) {
     if (location.isDirectory()) {
         File[] arr = location.listFiles();
         for (File f : arr) {
             File found = searchFile(f, fileName);
             if (found != null)
                 return found;
         }
     } else {
         if (location.getName().equals(fileName)) {
             return location;
         }
     }
     return null;
}

When you are looping through such a big number, recursion is very bad... it adds loads of overhead for JVM... but again it depends on how deep is your directory structure as well... — Ketan
– Ketan, Commented Nov 19, 2018 at 5:00

Lie Ryan · Accepted Answer · 2018-11-19 10:41:04Z

3

You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.

Essentially:

void buildIndex(Map index, File baseDir) {
    if (location.isDirectory()) {
        File[] arr = location.listFiles();
        for (File f : arr) {
            buildIndex(index, f);
        }
    } else {
        index.put(f.getName(), f);
    }
}

Now that you've got the index, searching for the files becomes trivial.

Now you've got the files in a Map, you can also even use Set operation to find the intersection:

Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);

Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.

edited Nov 19, 2018 at 10:41

answered Nov 19, 2018 at 4:24

Lie Ryan

65.3k14 gold badges103 silver badges151 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ketan Over a year ago

you are right, and in this case index itself will grow too big... so better to divide the index on various criteria, the simplest being by first character of the file name.Long back I had asked a question which is about removing duplicates but the core logic suggested by various people might be applicable... stackoverflow.com/questions/12501112/…

Lie Ryan Over a year ago

@Ketan: I'd suggest don't bother about trying to do tricks like splitting the index by first characters or things like that. Just let SQLite deal with managing the data. However, 5 million filenames is tiny amount of data. It's just at most around a gigabyte (likely much less), which is still fairly small in modern machines. You'll likely never need to use anything other than Map unless your data is a couple order of magnitude larger than this.

Samarjit Baruah Over a year ago

SQLite is not an option in my plate.I could able to reduce the time by creating index but since large volume data were stored in a shared drive so it is still taking time to load all files to create the index Map. Thanks a lot for the hints of indexing!

utpal416 · Accepted Answer · 2018-11-19 20:16:22Z

1

Searching in a Directory or a Network Associated Storage is a nightmare.It takes lot of time when directory is too big / depth. As you are in Java 6 , So you can follow an old fashion approach. List all files in a CSV file like below.
e.g

find . -type f -name '*.txt' >> test.csv . (if unix)

dir /b/s *.txt > test.csv (if Windows)
Now load this CSV file into a Map to have an index as filename. Loading the file will take some time as it will be huge but once you load then searching in the map ( as it will be file name ) will be much more quick and will reduce your search time drastically.

answered Nov 19, 2018 at 20:16

utpal416

9375 silver badges17 bronze badges

1 Comment

Samarjit Baruah Over a year ago

Thanks for the excellent idea . I think running locally after getting the list of files as CSV using above command was really helpful .After that, I generated a HashMap from the CSV and did all my report processing. Now I am able to generate report in very less time.You saved my day. Cheers!!

Jazzwave06 · Accepted Answer · 2018-11-18 23:52:18Z

0

You can use NIO FileVisitor, available in java 6.

Path findTransactionFile(Path root) {
    Path transactionFile = null;
    Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
        @Override
        public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
            if (/* todo dir predicate*/ false) {
                return FileVisitResult.SKIP_SUBTREE; // optimization
            }
            return FileVisitResult.CONTINUE;
        }

        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            if (/* todo file predicate*/ true) {
                transactionFile = file;
                return FileVisitResult.TERMINATE; // found    
            }
            return FileVisitResult.CONTINUE;
        }
    });

    return transactionFile;
}

answered Nov 18, 2018 at 23:52

Jazzwave06

1,8711 gold badge11 silver badges21 bronze badges

1 Comment

Jazzwave06 Over a year ago

I think I'm wrong and it's available since java 7, you can use Apache VFS for similar behaviour in older java

hunter · Accepted Answer · 2018-11-19 04:22:03Z

0

I dont know the answer, but from algorithm perspective, your program has the worst complexity. per single look up for single transaction , it iterates all the files (5 million). you have 3 million transactions.

my suggestion is to iterate the files (5 million files) and build up an index based on the file name. then iterate the transactions and search the index instead of full scan. Or there might be third party free tools that can index a large file system and then that index can be accessed by an external application (in this case your java app). if you can not find that kind of tool, better you invent it (then you can build the index in a optimum way that suits your requirement).

edited Nov 19, 2018 at 4:22

answered Nov 19, 2018 at 4:13

hunter

4,2231 gold badge18 silver badges19 bronze badges

Collectives™ on Stack Overflow

Performance optimization searching data in file system

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related