1

So basically, for this assignment I'm working on, we have to read in from a huge file of about a million lines, store the keys and values in a data structure of our choice (I'm using hash tables), offer functionality to change values for keys, and then save the key value stores back into a file. I'm using the cuckoo hashing method along with a method I found from a Harvard paper called "stashing" to accomplish this, and I'm fine with all of it. My only concern is the amount of time it is taking the program just to read in the data from the file.

The file is formatted so that each line has a key (integer) and a value (String) written like this:

12345 'abcdef'

23456 'bcdefg'

and so on. The method I have come up with to read this in is this:

private static void readData() throws IOException {
    try {
        BufferedReader inStream = new BufferedReader(new FileReader("input/data.db"));
        StreamTokenizer st = new StreamTokenizer(inStream);
        String line = inStream.readLine();
        do{
            String[] arr = line.split(" ");
            line = inStream.readLine();
            Long n = Long.parseLong(arr[0]);
            String s = arr[1];
            //HashNode<Long, String> node = HashNode.create(n, s); 
            //table = HashTable.empty();
            //table.add(n, s);

        }while(line != null);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

The method works fine for actually getting the data, however I tested it with our test file of a million lines and it took about 20 minutes for it to get all the way through reading this all in. Surely, this isn't a fast time for reading in data from a file, and I am positive there must be a better way of doing it.

I have tried several different methods for input (BufferedInputStream with FileInputStream, using Scanner however the file extension is .db so Scanner didn't work, I initially didn't have the tokenizer but added it in hopes it would help). I don't know if the computer I'm running it on makes much of a difference. I have a MacBook Air that I am currently doing the run on; however, I am having a mate run it on his laptop in a bit to see if that might help it along. Any input on how to help this or what I might be doing to slow things SO much would be sincerely and greatly appreciated.

P.S. please don't hate me for programming on a Mac :-)

2 Answers 2

2

You can use "java.nio.file.*", the following code is written in Java 8 style but can be easily modified to earlier versions on Java if needed:

        Map<Long, String> map = new HashMap<>();
        Files.lines(Paths.get("full-path-to-your-file")).forEach(line -> {
            String[] arr = line.split(" ");
            Long number = Long.parseLong(arr[0]);
            String string = arr[1];
            map.put(number, string);
        });

There is an additional performance gain since Files.lines(..).forEach(...) is executed in parallel. Which means that the lines will not be in-order (and in our case - you don't need it to), in case you needed it to be in order you could call: forEachOrdered().

On my MacBook it took less than 5 seconds to write 2 million such records to a file and then read it and populate the map.

Sign up to request clarification or add additional context in comments.

1 Comment

This worked awesome for me. I tweaked the rest of my program, finished my hashing functions and now it takes about 2 seconds for the 1 million lines to be read and stored. Thank you!
1

Get rid of the StreamTokenizer. You can read millions of lines per second with BufferedReader.readLine(), and that's all you're really doing: no tokenization.

But I strongly suspect the time isn't being spent in I/O but in processing each line.

NB Your do/while loop is normally written as a while loop:

while ((line = in.readLine()) != null)

Much clearer that way, and no risk of NPEs.

1 Comment

I have waited for the code to fully compute, no errors at the end. Also I have it ignoring the first line on purpose, the first line is a header containing the name of the file which I don't want it reading in. Unfortunately this is the entire Main class other than the main method which simply calls this method, thats all so far. I will remove the tokenizer to see if that can help in any way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.