Huge memory overhead when reading a large data file in java

Question

I'm doing deep learning neural net development, using the MNIST dataset for testing. The training set is composed of 60,000 sequences, each with 784 double precision input values. The process of reading this data from the file into an array in java is somehow incurring an approximately 4GB memory overhead, which remains allocated throughout the run of the program. This overhead is in addition to the 60000*784*8 = 376MB which is allocated for the double precision array itself. It seems likely that this overhead is occurring because java is storing a complete copy of the file in a addition to the numerical array, but perhaps this is Scanner overhead.

According to a source, reading the file as a stream avoids storing the entire file in memory. However, I still have this problem with a stream read. I'm using Java 8 with Intellij 2016.2.4. This is the stream reading code:

FileInputStream inputStream = null;
Scanner fileScan = null;
String line;
String[] numbersAsStrings;

totalTrainingSequenceArray = new double[60000][784];

try {
    inputStream = new FileInputStream(m_sequenceFile);
    fileScan = new Scanner(inputStream, "UTF-8");
    int sequenceNum = 0;
    line = fileScan.nextLine();//Read and discard the first line.
    while (fileScan.hasNextLine()) {
        line = fileScan.nextLine();
        numbersAsStrings = line.split("\\s+"); //Split the line into an array of strings using any whitespace delimiter.
        for (int inputPosition = 0; inputPosition < m_numInputs; inputPosition++) {
            totalTrainingSequenceArray[sequenceNum][inputPosition] = Double.parseDouble(numbersAsStrings[inputPosition]);
        }
        sequenceNum++;
    }
    if (fileScan.ioException() != null) {//Handle fileScan exception
        throw fileScan.ioException();
    }
} catch (IOException e) {//Handle the inputstream exception
    e.printStackTrace();
} finally {
    if (inputStream != null)  {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    if (fileScan != null) {
        fileScan.close();
    }
}

I've tried setting the stream and the scanner to null after the read and calling System.gc(), but this does nothing. Is this a Scanner overhead issue? What would be the simplest way to read this large data file without incurring large permanent overhead? Thank you for any input.

If you are using Java 8 you could benefit from the Files.lines() method. — assylias
– assylias, Commented Jan 3, 2017 at 18:59
I would recommend using VisualVM to see what is actually using the heap and how the GC behaves... — Meo
– Meo, Commented Jan 3, 2017 at 20:37

Meo · Accepted Answer · 2017-01-03 21:43:07Z

2

Your code works just fine. 380MB of heap will be actually used after a full GC.

Java is eager to allocate memory to minimize GC overhead, you could limit the size of allocated memory by using -Xmx512m parameter or by using a different GC - e.g. -XX:+UseConcMarkSweepGC or by -XX:MaxHeapFreeRatio=40.

edited Jan 3, 2017 at 21:43

answered Jan 3, 2017 at 21:13

Meo

12.6k8 gold badges47 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nicholas Newell Over a year ago

Thanks for your response. The large allocation (much of it due to my high -Xms value) was a problem for me because the GC wasn't happening, so I wasn't getting a clear idea of just how much memory each job actually needed. I need to have an idea of this because I'm still in the stage of setting up workstations, and I have to know how much memory to give each one (each workstation will run 4 or more jobs like this at once).

Durandal · Accepted Answer · 2017-01-03 21:15:17Z

1

Define "overhead". The VM uses the alloted heap to balance between garbage collection time and execution speed (there are some screws you can turn to influence its decisions).

The norm is the VM letting the heap fill until the gc threshold is reached, then collect whatever garbage can be collected, then contine execution (thats simplified a lot). This leads to a "sawtooth" pattern in heap usage (gradual filling, then sudden drop of heap usage). This is completely normal for code that produces garbage at a rate.

The points you can influence are how high the "teeth" can build (by adjusting allowed heap and/or when the gc should kick in). The speed of garbage creation (how sharply heap usage rises) depends on the code executed, it can range anywhere from zero to the maximum attainable allocation rate.

Your reading code is of the type to create lot of small garbage objects: the line from the scanner, the parts you split the line into. If your heap is large enough, the entire file can be read without collecting any of that garbage (most likely thats the case with your 4GB heap setting).

If you make the heap smaller, the VM will collect garbage sooner, reducing the memory usage (likewise you can play with the gc parameters to force collection at a smaller percentage of heap used).

Its unreasonable though to expect the code to run with just the amount of memory you calculated for your array. What you see in the task manager is just the accumulation of all memory used by the VM. That includes stack, any resources needed for the JRE, native libraries and the heap.

Memory outside the heap can vary wildly, depending on how many threads, files and other resources your program uses. As a very rough rule of thumb, at least 20-50 MB are used by the JRE itself, even for just running something simple like a "Hello world".

The problem with VM tuning, regardless if you just adjust heap size or fine tune gc parameters, is that it has to be redone whenever the problem set changes (e.g. you could probably get away with -Xmx512m for your current file, but you would need to adjust the value for the next file).

Alternately, you could attempt to reduce the amount of garbage created, ideally to zero. Instead of the scanner, reading line by line, you could read character by character and do the parsing with a state machine. This will greatly reduce garbage creation, but make the code much more complex.

In many cases the most "efficient" solution is simply not to worry about memory usage - the time spent optimizing VM parameters or code would probably be more efficiently spent by focusing on making progress with your program. As long as "overhead" doesn't hinder you, why bother?

answered Jan 3, 2017 at 21:15

Durandal

20.2k4 gold badges39 silver badges71 bronze badges

2 Comments

Nicholas Newell Over a year ago

Thanks for this thorough explanation! Part of the issue for me is that I'm at the stage of figuring out how much memory I'm going to need in my workstations, and this was making an accurate determination difficult - I'm running 5 JCuda jobs at once on each workstation, so the uncollected 4GB from each job was really blurring the picture.

Nicholas Newell Over a year ago

(Continued, since I took too long in editing)...Part of the issue too was that coming from C++, I'm used to controlling exactly how much memory my application is using. But I'm using Java because for me it makes a better rapid prototyping language, in part because I don't have to manage memory...so I'll get used to it.

Collectives™ on Stack Overflow

Huge memory overhead when reading a large data file in java

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related