Creating a large sparse matrix in scipy.sparse

Question

I am using scipy.sparse in my application and want to do some performance tests. In order to do that, I need to create a large sparse matrix (which I will then use in my application). As long as the matrix is small, I can create it using the command

import scipy.sparse as sp
a = sp.rand(1000,1000,0.01)

Which results in a 1000 by 1000 matrix with 10.000 nonzero entries (a reasonable density meaning approximately 10 nonzero entries per row)

The problem is when I try to create a larger matrix, for example, a 100.000 by 100.000 matrix (I have dealt with way larger matrices before), I run

import scipy.sparse as sp
N = 100000
d = 0.0001
a = sp.rand(N, N, d)

which should result in a 100.000 by 100.000 matrix with one million nonzero entries (way in the realm of possible), I get an error message:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    sp.rand(100000,100000,0.0000001)
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
    j = random_state.randint(mn)
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
OverflowError: Python int too large to convert to C long

Which is some annoying internal scipy error I cannot remove.

I understand that I can create a 10*n by 10*n matrix by creating one hundred n by n matrices, then stacking them together, however, I think that scipy.sparse should be able to handle the creation of large sparse matrices (I say again, 100k by 100k is by no means large, and scipy is more than comfortable handling matrices with several million rows). Am I missing something?

This is probably because it's picking the random entries to give your matrix by selecting a 32 bit int between 0 and N*M, and the max 32 bit (signed) int is 2^31-1 (100,000*100,000 = 10,000,000,000 > 2,147,483,647 = 2^31-1). Building it in blocks using bmat is probably the easiest work around. Try making N*M = 2^31-2 and then 2^31 and see if that causes the problem to pop up. — will
– will, Commented Feb 24, 2015 at 10:59
I can't edit that previous comment anymore, but that error is consistent with what i describe: Python int too large to convert to C long and the limits in the climits header. — will
– will, Commented Feb 24, 2015 at 11:07
This probably occurs only on 32-bit Python, which is probably why the bug wasn't noticed earlier. — pv.
– pv., Commented Feb 24, 2015 at 11:12
as Jan-Philip Gehrcke poitns out below, it is system dependent - I think you should be able to have a look in stdint.h on your system though and see what your limits are. — will
– will, Commented Feb 24, 2015 at 11:22

Community · Accepted Answer · 2017-05-23 11:50:59Z

3

Without getting to the bottom of the issue, you should make sure that you are using a 64 bit build on a 64 bit architecture, on a Linux platform. There, the native "long" data type is of 64 bit size (as opposed to Windows, I believe).

For reference, see these tables:

http://www.unix.org/whitepapers/64bit.html (-> long is 64 bit on LP64)
http://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

Edit: Maybe I was not explicit enough before -- on a 64 bit Windows, the classical native "long" data type is of 32 bit size (also see this question). This might be a problem in your case. That is, your code might just work when you change platform to Linux. I cannot say this with absolute certainty, because it really depends on which native data types are used in the numpy/scipy C source (of course there are 64 bit data types available on Windows, and usually a platform case analysis is performed with compiler directives, and proper types are chosen via macros -- I cannot really imagine that they've used 32 bit data types by accident).

Edit 2:

I can provide three data samples supporting my hypothesis.

Debian 64 bit, Python 2.7.3 and SciPy 0.10.1 binaries from Debian repos:

Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
0.10.1
(100000, 100000)

Windows 7 64 bit, 32 bit Python build, 32 bit SciPy 0.10.1 build, both from ActivePython:

ActivePython 2.7.5.6 (ActiveState Software Inc.) based on
Python 2.7.5 (default, Sep 16 2013, 23:16:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
0.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user\AppData\Roaming\Python\Python27\site-packages\scipy\sparse\construct.py", line 426, in rand
    raise ValueError(msg % np.iinfo(tp).max)
ValueError: Trying to generate a random sparse matrix such as the product of dimensions is
greater than 2147483647 - this is not supported on this machine

Windows 7 64 bit, 64 bit ActivePython build, 64 bit SciPy 0.15.1 build (from Gohlke, build against MKL):

ActivePython 3.4.1.0 (ActiveState Software Inc.) based on
Python 3.4.1 (default, Aug  7 2014, 13:09:27) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
'0.15.1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
    j = random_state.randint(mn)
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
OverflowError: Python int too large to convert to C long

edited May 23, 2017 at 11:50

CommunityBot

11 silver badge

answered Feb 24, 2015 at 11:14

Dr. Jan-Philip Gehrcke

36.4k14 gold badges90 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

5xum Over a year ago

I am using a 64 bit build on a 64bit Python on a 64bit windows 7 platform.

5xum Over a year ago

As I do not have a Linux platform to test your assumption on, I can only guess that you are correct

Dr. Jan-Philip Gehrcke Over a year ago

Also, there are no official 64 bit builds of numpy available for Windows -- what did you install, actually? Did you use lfd.uci.edu/~gohlke/pythonlibs/#numpy?

5xum Over a year ago

Yes, I used the unofficial binary. It worked well for me in the past.

Dr. Jan-Philip Gehrcke Over a year ago

Gohlke's builds are created with Intel's compiler suite. It could be that this data type "confusion" is a weakness of these compilers. I am not sure which compilers others (third party Python distributions) are using, but maybe you want to try Enthought or ActiveState or Anaconda Python. They all bring their own builds of NumPy. It could be that one of them does not suffer from what you observe.

Collectives™ on Stack Overflow

Creating a large sparse matrix in scipy.sparse

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest