How can I get around memory limitation in this script?

Question

I'm trying to normalize my dataset which is 1.7 Gigabyte. I have 14Gig of RAM and I hit my limit very quickly.

This happens when computing the mean/std of the training data. The training data takes up the majority of the memory when loaded into RAM(13.8Gig),thus the mean gets calculated, but when it reaches to the next line while calculating the std, it crashes.

Follows the script:

import caffe
import leveldb
import numpy as np
from caffe.proto import caffe_pb2
import cv2
import sys
import time

direct = 'examples/svhn/'
db_train = leveldb.LevelDB(direct+'svhn_train_leveldb')
db_test = leveldb.LevelDB(direct+'svhn_test_leveldb')
datum = caffe_pb2.Datum()

#using the whole dataset for training which is 604,388
size_train = 604388 #normal training set is 73257
size_test = 26032
data_train = np.zeros((size_train, 3, 32, 32))
label_train = np.zeros(size_train, dtype=int)

print 'Reading training data...'
i = -1
for key, value in db_train.RangeIter():
    i = i + 1
    if i % 1000 == 0:
        print i
    if i == size_train:
        break
    datum.ParseFromString(value)
    label = datum.label
    data = caffe.io.datum_to_array(datum)
    data_train[i] = data
    label_train[i] = label

print 'Computing statistics...'
print 'calculating mean...'
mean = np.mean(data_train, axis=(0,2,3))
print 'calculating std...'
std = np.std(data_train, axis=(0,2,3))

#np.savetxt('mean_svhn.txt', mean)
#np.savetxt('std_svhn.txt', std)

print 'Normalizing training'
for i in range(3):
        print i
        data_train[:, i, :, :] = data_train[:, i, :, :] - mean[i]
        data_train[:, i, :, :] = data_train[:, i, :, :]/std[i]


print 'Outputting training data'
leveldb_file = direct + 'svhn_train_leveldb_normalized'
batch_size = size_train

# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()

for i in range(size_train):
    if i % 1000 == 0:
        print i

    # save in datum
    datum = caffe.io.array_to_datum(data_train[i], label_train[i])
    keystr = '{:0>5d}'.format(i)
    batch.Put( keystr, datum.SerializeToString() )

    # write batch
    if(i + 1) % batch_size == 0:
        db.Write(batch, sync=True)
        batch = leveldb.WriteBatch()
        print (i + 1)

# write last batch
if (i+1) % batch_size != 0:
    db.Write(batch, sync=True)
    print 'last batch'
    print (i + 1)
#explicitly freeing memory to avoid hitting the limit!
#del data_train
#del label_train

print 'Reading test data...'
data_test = np.zeros((size_test, 3, 32, 32))
label_test = np.zeros(size_test, dtype=int)
i = -1
for key, value in db_test.RangeIter():
    i = i + 1
    if i % 1000 == 0:
        print i
    if i ==size_test:
        break
    datum.ParseFromString(value)
    label = datum.label
    data = caffe.io.datum_to_array(datum)
    data_test[i] = data
    label_test[i] = label

print 'Normalizing test'
for i in range(3):
        print i
        data_test[:, i, :, :] = data_test[:, i, :, :] - mean[i]
        data_test[:, i, :, :] = data_test[:, i, :, :]/std[i]

#Zero Padding
#print 'Padding...'
#npad = ((0,0), (0,0), (4,4), (4,4))
#data_train = np.pad(data_train, pad_width=npad, mode='constant', constant_values=0)
#data_test = np.pad(data_test, pad_width=npad, mode='constant', constant_values=0)

print 'Outputting test data'
leveldb_file = direct + 'svhn_test_leveldb_normalized'
batch_size = size_test

# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()

for i in range(size_test):
    # save in datum
    datum = caffe.io.array_to_datum(data_test[i], label_test[i])
    keystr = '{:0>5d}'.format(i)
    batch.Put( keystr, datum.SerializeToString() )

    # write batch
    if(i + 1) % batch_size == 0:
        db.Write(batch, sync=True)
        batch = leveldb.WriteBatch()
        print (i + 1)

# write last batch
if (i+1) % batch_size != 0:
    db.Write(batch, sync=True)
    print 'last batch'
    print (i + 1)

How can I make it consume less memory so that I can get to run the script?

The larger the data you are trying to normalize, the more likely you are to run out of memory to run the program; make the data size less — Shaydoth
– Shaydoth, Commented Oct 6, 2016 at 10:00
if that means reducing the training set, that is not possible. I need to normalize the whole set. — Hossein
– Hossein, Commented Oct 6, 2016 at 10:18
You can read the data as a memory map (docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html), or write your own version of mean which sequentially reads one (or multiple) datapoint(s) in the file and computes the mean/std. The mean is just the sum of all data divided by the length (pseudocode: sum_i x_i / N), so you don't need the whole dataset in memory to compute the mean. Same for the standard deviation, you don't need the whole dataset in memory, just compute the mean and calculate sqrt(sum_i (x_i - x_mean)**2 / (N-1)). — Jan Christoph Terasa
– Jan Christoph Terasa, Commented Oct 6, 2016 at 10:51
@ChristophTerasa: I tried your suggestion, but after deleting the pointer which should save the file onto the disk, it doesnt do anything, I mean, a file does get created, with the size of 14.9 Gigabyte, but the next command does not get executed! This is the codes regarding our case: fp = np.memmap(directory+'train_data_memmap', dtype=data_train.dtype, mode='w+', shape=data_train.shape) fp[:] = data_train[:] del fp raw_input('saved! check the file & press to continue') the message does not get printed and pressing any keys wont do anything! — Hossein
– Hossein, Commented Oct 6, 2016 at 12:34
@christophTerasa: does dividing the whole dataset into two sets, and calculate mean/std for each of the set, and then adding their mean/std together, result in the same mean/std when the whole dataset is used all at once? Im thinking to read half of the dataset , calculate its mean/std and then save it with the next half and thus get around the memory issue. Is this correct at all ? — Hossein
– Hossein, Commented Oct 15, 2016 at 20:06

Bill Cheatham · Accepted Answer · 2016-10-06 10:25:48Z

1

Why not compute the statistics on a subset of the original data? For example, here we compute the mean and std for just 100 points:

sample_size = 100
data_train = np.random.rand(1000, 20, 10, 10)

# Take subset of training data
idxs = np.random.choice(data_train.shape[0], sample_size)
data_train_subset = data_train[idxs]

# Compute stats
mean = np.mean(data_train_subset, axis=(0,2,3))
std = np.std(data_train_subset, axis=(0,2,3))

If your data is 1.7Gb, it is highly unlikely that you need all the data to get an accurate estimation of the mean and std.

In addition, could you get away with fewer bits in your datatype? I'm not sure what datatype caffe.io.datum_to_array returns, but you could do:

data = caffe.io.datum_to_array(datum).astype(np.float32)

to ensure the data is float32 format. (If the data is currently float64, then this will save you half the space).

answered Oct 6, 2016 at 10:25

Bill Cheatham

12k18 gold badges72 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Hossein Over a year ago

Thanks, unfortunately its float64, since caffe doesnt support float32 I have to use the default. and since I need to save the normalized training set, I need to load it into ram anyway.

Bill Cheatham Over a year ago

OK that's a shame about float32. For the normalisation though, my point is to only compute the statistics using the subset. Then you can normalise the whole dataset using these statistics. This saves you ram during the computation of mean and std, which I think is where your script is failing.

Hossein Over a year ago

Thanks, I give a try ;)

halfer · Accepted Answer · 2020-04-03 10:14:34Z

The culprit that caused so much issues and constant crashing due to insufficient memory, was due to batch size being the size of whole training set:

print 'Outputting test data'
leveldb_file = direct + 'svhn_test_leveldb_normalized'
batch_size = size_test

This apparently was the cause, nothing would get committed and saved to the disk until the whole dataset was read and loaded into one huge transaction, this is also the case when using np.float32 suggested by @BillCheatham didn't work properly.

The memorymap solution wouldn't work for me for some reason and I used the solution I mentioned above.

PS: Later on, I completely changed to float32, fixed the batch_size and ran the whole thing all together, that's how I could say my former solution (divide and add the fractions together) works and gives the exact number up to 2 decimals.

Collectives™ on Stack Overflow

How can I get around memory limitation in this script?

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related