Does the np.nan in numpy array occupy memory?

Question

I have a huge file of csv which can not be loaded into memory. Transforming it to libsvm format may save some memory. There are many nan in csv file. If I read lines and store them as np.array, with np.nan as NULL, will the array still occupy too much memory ? Does the np.nan in array also occupy memory ?

Does the np.nan in array also occupy memory? A numpy array is a homogeneous fixed-size record data structure, i.e. the same amount of memory is allocated for each of its elements (e.g. 4 bytes for float32 and 8 bytes for float64). numpy.nan is simply represented by a special (reserved) bit pattern. — Leon
– Leon, Commented Jun 19, 2017 at 7:26
Numpy arrays are contiguous (assuming C ordering and no transpose) blocks of memory. No matter what you store on it, it will occupy space equivalent to its shape and data type. Scipy has sparse matrices that you could use to ignore nans. — Imanol Luengo
– Imanol Luengo, Commented Jun 19, 2017 at 7:26
You might find this question helpful which constructs a sparse scipy matrix from a CSV. — Jan Trienes
– Jan Trienes, Commented Jun 19, 2017 at 7:31
scikit-learn does work with (lib)svm. scikit-learn.org/stable/modules/svm.html. But you'll need to read its docs to see whether that helps with your memory issues. — hpaulj
– hpaulj, Commented Jun 19, 2017 at 8:13

Shai · Accepted Answer · 2017-06-19 08:11:18Z

7

When working with floating point representations of numbers, non-numeric values (NaN and inf) are also represented by a specific binary pattern occupying the same number of bits as any numeric floating point value. Therefore, NaNs occupy the same amount of memory as any other number in the array.

answered Jun 19, 2017 at 8:11

Shai

115k39 gold badges259 silver badges398 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kailegh · Accepted Answer · 2017-06-19 08:07:58Z

4

As far as I know yes, nan and zero values occupy the same memory as any other value, however, you can address your problem in other ways:

Have you tried using a sparse vector? they are intended for vectors with a lot of 0 values and memory consumption is optimized

SVM Module Scipy

Sparse matrices Scipy

There you have some info about SVM and sparse matrices, if you have further questions just ask.

Edited to provide an answer as well as a solution

edited Jun 19, 2017 at 8:07

answered Jun 19, 2017 at 7:22

Kailegh

1971 silver badge15 bronze badges

4 Comments

yanachen Over a year ago

Which package should I use? The scipy ? Any examples ? Thanks.

yanachen Over a year ago

I am not sure the sparse vector will support xgboost. Because my goal is to train model on it.

hpaulj Over a year ago

Do not use the sparse matrix code unless your learning/training code explicitly says you can. Some scikit-learn methods do.

Kailegh Over a year ago

I have sometimes used it to train a SVM using scipy, if you are interested in it I can look for my code and post it

Marvin Taschenberger · Accepted Answer · 2017-06-19 07:41:26Z

3

According to the getsizeof() command from the sys module it does. A simple and fast example :

import sys
import numpy as np 

x = np.array([1,2,3])
y = np.array([1,np.nan,3])

x_size = sys.getsizeof(x)
y_size = sys.getsizeof(y)
print(x_size)
print(y_size)
print(y_size == x_size)

This should print out

 120
 120 
 True

so my conclusion was it uses as much memory as a normal entry.

Instead you could use sparse matrices (Scipy.sparse) which do not save zero / Null at all and therefore are more memory efficient. But Scipy strongly discourages from using Numpy methods directly https://docs.scipy.org/doc/scipy/reference/sparse.html since Numpy might not interpret them correctly.

answered Jun 19, 2017 at 7:41

Marvin Taschenberger

6167 silver badges19 bronze badges

2 Comments

Eric Over a year ago

On my machine, this prints 108, 120, False, because x.dtype == np.int32. To make this a useful example, you should use 1.0, 2.0, 3.0, which will make the arrays have the same type

Marvin Taschenberger Over a year ago

Okay, sorry. I didn't know that there might be a difference between machines for that example. But to be fair on my machine, it works like that. And furthermore x.dtype == np.int64 and analogously the datatype for y ==np.float64 in my case

Collectives™ on Stack Overflow

Does the np.nan in numpy array occupy memory?

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related