I'm doing deep learning neural net development, using the MNIST dataset for testing. The training set is composed of 60,000 sequences, each with 784 double precision input values. The process of reading this data from the file into an array in java is somehow incurring an approximately 4GB memory overhead, which remains allocated throughout the run of the program. This overhead is in addition to the 60000*784*8 = 376MB which is allocated for the double precision array itself. It seems likely that this overhead is occurring because java is storing a complete copy of the file in a addition to the numerical array, but perhaps this is Scanner overhead.
According to a source, reading the file as a stream avoids storing the entire file in memory. However, I still have this problem with a stream read. I'm using Java 8 with Intellij 2016.2.4. This is the stream reading code:
FileInputStream inputStream = null;
Scanner fileScan = null;
String line;
String[] numbersAsStrings;
totalTrainingSequenceArray = new double[60000][784];
try {
inputStream = new FileInputStream(m_sequenceFile);
fileScan = new Scanner(inputStream, "UTF-8");
int sequenceNum = 0;
line = fileScan.nextLine();//Read and discard the first line.
while (fileScan.hasNextLine()) {
line = fileScan.nextLine();
numbersAsStrings = line.split("\\s+"); //Split the line into an array of strings using any whitespace delimiter.
for (int inputPosition = 0; inputPosition < m_numInputs; inputPosition++) {
totalTrainingSequenceArray[sequenceNum][inputPosition] = Double.parseDouble(numbersAsStrings[inputPosition]);
}
sequenceNum++;
}
if (fileScan.ioException() != null) {//Handle fileScan exception
throw fileScan.ioException();
}
} catch (IOException e) {//Handle the inputstream exception
e.printStackTrace();
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (fileScan != null) {
fileScan.close();
}
}
I've tried setting the stream and the scanner to null after the read and calling System.gc(), but this does nothing. Is this a Scanner overhead issue? What would be the simplest way to read this large data file without incurring large permanent overhead? Thank you for any input.
Files.lines()method.