How to speed up reading in from a massive file (Java)

Question

So basically, for this assignment I'm working on, we have to read in from a huge file of about a million lines, store the keys and values in a data structure of our choice (I'm using hash tables), offer functionality to change values for keys, and then save the key value stores back into a file. I'm using the cuckoo hashing method along with a method I found from a Harvard paper called "stashing" to accomplish this, and I'm fine with all of it. My only concern is the amount of time it is taking the program just to read in the data from the file.

The file is formatted so that each line has a key (integer) and a value (String) written like this:

12345 'abcdef'

23456 'bcdefg'

and so on. The method I have come up with to read this in is this:

private static void readData() throws IOException {
    try {
        BufferedReader inStream = new BufferedReader(new FileReader("input/data.db"));
        StreamTokenizer st = new StreamTokenizer(inStream);
        String line = inStream.readLine();
        do{
            String[] arr = line.split(" ");
            line = inStream.readLine();
            Long n = Long.parseLong(arr[0]);
            String s = arr[1];
            //HashNode<Long, String> node = HashNode.create(n, s); 
            //table = HashTable.empty();
            //table.add(n, s);

        }while(line != null);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

The method works fine for actually getting the data, however I tested it with our test file of a million lines and it took about 20 minutes for it to get all the way through reading this all in. Surely, this isn't a fast time for reading in data from a file, and I am positive there must be a better way of doing it.

I have tried several different methods for input (BufferedInputStream with FileInputStream, using Scanner however the file extension is .db so Scanner didn't work, I initially didn't have the tokenizer but added it in hopes it would help). I don't know if the computer I'm running it on makes much of a difference. I have a MacBook Air that I am currently doing the run on; however, I am having a mate run it on his laptop in a bit to see if that might help it along. Any input on how to help this or what I might be doing to slow things SO much would be sincerely and greatly appreciated.

P.S. please don't hate me for programming on a Mac :-)

Nir Alfasi · Accepted Answer · 2017-10-14 00:00:05Z

2

You can use "java.nio.file.*", the following code is written in Java 8 style but can be easily modified to earlier versions on Java if needed:

        Map<Long, String> map = new HashMap<>();
        Files.lines(Paths.get("full-path-to-your-file")).forEach(line -> {
            String[] arr = line.split(" ");
            Long number = Long.parseLong(arr[0]);
            String string = arr[1];
            map.put(number, string);
        });

There is an additional performance gain since Files.lines(..).forEach(...) is executed in parallel. Which means that the lines will not be in-order (and in our case - you don't need it to), in case you needed it to be in order you could call: forEachOrdered().

On my MacBook it took less than 5 seconds to write 2 million such records to a file and then read it and populate the map.

edited Oct 14, 2017 at 0:00

answered Oct 13, 2017 at 23:54

Nir Alfasi

53.6k11 gold badges94 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

koko985 Over a year ago

This worked awesome for me. I tweaked the rest of my program, finished my hashing functions and now it takes about 2 seconds for the 1 million lines to be read and stored. Thank you!

user207421 · Accepted Answer · 2017-10-13 23:54:07Z

1

Get rid of the StreamTokenizer. You can read millions of lines per second with BufferedReader.readLine(), and that's all you're really doing: no tokenization.

But I strongly suspect the time isn't being spent in I/O but in processing each line.

NB Your do/while loop is normally written as a while loop:

while ((line = in.readLine()) != null)

Much clearer that way, and no risk of NPEs.

edited Oct 13, 2017 at 23:54

answered Oct 13, 2017 at 23:20

user207421

312k45 gold badges324 silver badges493 bronze badges

1 Comment

koko985 Over a year ago

I have waited for the code to fully compute, no errors at the end. Also I have it ignoring the first line on purpose, the first line is a header containing the name of the file which I don't want it reading in. Unfortunately this is the entire Main class other than the main method which simply calls this method, thats all so far. I will remove the tokenizer to see if that can help in any way.

Collectives™ on Stack Overflow

How to speed up reading in from a massive file (Java)

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related