Read big text file to HashMap - heap overflow

Question

I'm trying to get the data from a text file into a HashMap. The text-file has the following format:

enter image description here

it has something like 7 million lines... (size: 700MB)

So what I do is: I read each line, then I take the fields in green and concatenate them into a string which will the HashMap key. The Value will be the fild in red.

everytime I read a line I have to check in the HashMap if there is already an entry with such key, if so, I just update the value summing the value with the red; If not, a new entry is added to the HashMap.

I tried this with text-files with 70.000 lines, and it works quite well.

But now with the 7 Million line text-file I get a "java heap space" issue, like in the image:

enter image description here

Is this due to the HashMap ? Is it possible to optimize my algorithm ?

To store 700 MB of text you will need at least 1.4 GB possibly closer to 3 GB with overhead and the HashMap. How much memory do you have? — Peter Lawrey
– Peter Lawrey, Commented Oct 25, 2012 at 19:51
Note that you don't need to check if there is an entry first, if the entry is already in the HashMap it will be replaced by the new value. — Keppil
– Keppil, Commented Oct 25, 2012 at 19:54
I just changed my netbeans.conf to -J-Xms500M -J-XX:PermSize=1500M I'll try like this and check it out... By the way I have 4GB of RAM — kri8or
– kri8or, Commented Oct 25, 2012 at 20:02
So I did this:-J-Xms600M -J-XX:PermSize=1600M...but still have the heap overflow, and If I add more to the Xms, Netbeans cannot start... :( — kri8or
– kri8or, Commented Oct 25, 2012 at 20:37
What do you need this large amount of data for? Huge data is best managed by reading in parts, and storing only small relevant parts in memory. You could create an API to read only required small parts of the huge data while keeping the rest away in disk. You can have the API return HashMaps for these small parts, then use the HashMap. If you want to read and process the whole file, repeatedly ask for HashMap from API in file sections, discarding already processed sections from memory. — ADTC
– ADTC, Commented Oct 25, 2012 at 20:44

Amit Deshpande · Accepted Answer · 2012-10-25 20:07:31Z

3

You should increase your heap space

-Xms<size>        set initial Java heap size
-Xmx<size>        set maximum Java heap size

java -Xms1024m -Xmx2048m

A nice read From Java code to Java heap

Table 3. Attributes of a HashMap
Default capacity                     16 entries
Empty size                           128 bytes
Overhead                             64 bytes plus 36 bytes per entry
Overhead for a 10K collection   ~    360K
Search/insert/delete performance    O(1) — Time taken is constant time, regardless of the number of elements (assuming no hash collisions)

If you consider above table overhead for 7 Million records come to around 246 MB so your minimum heap size must be around 1000 MB

edited Oct 25, 2012 at 20:07

answered Oct 25, 2012 at 19:51

Amit Deshpande

19.2k4 gold badges48 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

kri8or Over a year ago

I just changed my netbeans.conf to -J-Xms500M -J-XX:PermSize=1500M I'll try like this and check it out...

kri8or Over a year ago

So, should I try another structure besides HashMap ? Cause for each line I need to check if such entry already exists or not, so with hashmap I avoid using searching algorithms...

Amit Deshpande Over a year ago

@javardo You don't need to check if entry exists HashMap will replace entry with new value if it exists.

kri8or Over a year ago

Yes AmitD, but I also need to sum the 'red' value of the new line to the one in the HashMap, this if there's an entry. Cause if not I just add a new entry with the 'red' value.

16dots Over a year ago

@javardo Something like this ? : pastebin.com/b3GB6PS0 In the end you should have a map with the green string as keys and sum of the red integers as values.

|

DNA · Accepted Answer · 2012-10-25 21:54:18Z

1

As well as changing the heap size, consider 'compressing' (encoding) the keys by storing them as packed binary, not String.

Each IP address can be stored as 4 bytes. The port numbers (if that's what they are) are 2 bytes each. The protocol can probably be stored as a byte or less.

That's 13 bytes, rather than maybe 70 stored as a UTF16 String, reducing the memory for keys by a factor of 5, if my maths is correct at this time of night...

answered Oct 25, 2012 at 21:54

DNA

42.7k12 gold badges114 silver badges153 bronze badges

Collectives™ on Stack Overflow

Read big text file to HashMap - heap overflow

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related