2

As the title says, I'm seeing a big difference between the memory usage of a numpy array between Windows and Ubuntu.

Here's a simple code to replicate this issue:

import numpy as np
import joblib

a = [1]*1000
b = [a for i in range(1000)]
np_arr = np.array(b)

joblib.dump(np_arr, 'arr.h5')

If I run this code in a Windows 10 machine, arr.h5's size is 3907KB.

But it I run this on Ubuntu 18.04, it's 7812KB

The main issue is that I'm dealing with large datasets and my code runs fine on a Windows machine with 16GB, but I'm having Memory Errors on Ubuntu with 32GB

2
  • 2
    Is it possible that your Windows Python is compiled as a 32-bit application? Commented Sep 6, 2019 at 1:57
  • Both are 64-bit Python installations Commented Sep 6, 2019 at 2:13

1 Answer 1

4

Yep this is a difference between Windows and Linux...

The default integer in numpy is np.int_ which maps to a C long, see docs. The C standard doesn't specify the size of a long exactly, just that it's at least 32 bits (4 bytes). The actual size depends on the compiler and cpu architecture. There's already a discussion of this issue on the numpy bug tracker.

The problem can avoided by explicitly setting the integer type:

np_arr = np.array(b, dtype=np.int32)

If you know the smallest and largest value that your array will hold, you can potentially get away with choosing a smaller integer type, like for example int16 or uint8.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.