4

I want to create a scipy array from a really huge list. But unfortunately I stumbled across a problem.

I have a list xs, of strings. Each string has the length 1.

>>> type(xs)
<type 'list'>
>>> len(xs)
4001844816

If I convert only the first 10 elements, everything still works as expected.

>>> s = xs[0:10]
>>> x = scipy.array(s)
>>> x
array(['A', 'B', 'C', 'D', 'E', 'F', 'O', 'O'],
      dtype='|S1‘)
>>> len(x)
10

For the whole list I get this result:

>>> ary = scipy.array(xs)
>>> ary.size
1
>>> ary.shape
()
>>> ary[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: 0-d arrays can't be indexed
>>>ary[()]
...The long list

A workaround would be:

test = scipy.zeros(len(xs), dtype=(str, 1))
for i in xrange(len(xs)):
    test[i] = xs[i]

It is not a problem of insufficient memory. So far I will use the workaround (which takes 15 minutes). But I would like to understand the problem.

Thank you

-- Edit: Remark to workaround test[:] = xs will not work. (Also fails with 0-d IndexError)

On my macbook 2147483648 was the smallest size causing the problem. I determined it with this small script:

#!/usr/bin/python
import scipy as sp

startlen = 2147844816

xs = ["A"] * startlen
ary = sp.array(xs)
while ary.shape == ():
    print "bad", len(xs)
    xs.pop()
    ary = sp.array(xs)

print "good", len(xs)
print ary.shape, ary[0:10]
print "DONE."

This was the output

...
bad 2147483649
bad 2147483648
good 2147483647
(2147483647,) ['A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A']
DONE.

The python version is

>>> sys.version
'2.7.5 (default, Aug 25 2013, 00:04:04) \n[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'
>>> scipy.version.version
'0.11.0'
5
  • sorry, can't help since can't reproduce. can you find out what is smallest xs subset size causing the error? Commented Oct 30, 2013 at 21:50
  • @alko: I added info to question Commented Nov 3, 2013 at 10:03
  • Since 2147483648 = 2**31, I am pretty shure this is memory allocation/adressation limitation in scipy, probably due to signed int32 usage. I reccomend addressing this error to scipy issue tracker Commented Nov 3, 2013 at 10:18
  • What does import platform; platform.architecture() return? Saw that tip from @JoeKington Commented Nov 4, 2013 at 13:36
  • Yes, they are both 64bit (two machines on which I encountered the problem) Commented Nov 4, 2013 at 22:29

1 Answer 1

1

Assuming you have 64 bit OS/Python/Numpy you might be having some manifestations of out of memory conditions - which can sometimes be unusual. Your first list is 4GB then you allocated an additional 4GB for the numpy array. Even for x64 those are big arrays. Have you seen memmap arrays before?

What I have done below is created a series of memmap arrays showing where (for my machine) the breaking points are (primarily disk IO). However, decent array sizes can be created 30 Billion 'S1' elements. This code might help you to see if memmap array can provide some benefit for your problem. They are easy to work with. Your 15 minute workaround could be sped up using memmap arrays.

baseNumber = 3000000L
#dataType = 'float64'#
numBytes = 1
dataType = 'S1'
for powers in arange(1,7):
  l1 = baseNumber*10**powers
  print('working with %d elements'%(l1))
  print('number bytes required %f GB'%(l1*numBytes/1e9))
  try:
    fp = numpy.memmap('testa.map',dtype=dataType, mode='w+',shape=(1,l1))
    #works 
    print('works')
    del fp
  except Exception as e:
    print(repr(e))


"""
working with 30000000 elements
number bytes required 0.030000 GB
works
working with 300000000 elements
number bytes required 0.300000 GB
works
working with 3000000000 elements
number bytes required 3.000000 GB
works
working with 30000000000 elements
number bytes required 30.000000 GB
works
working with 300000000000 elements
number bytes required 300.000000 GB
IOError(28, 'No space left on device')
working with 3000000000000 elements
number bytes required 3000.000000 GB
IOError(28, 'No space left on device')


"""
Sign up to request clarification or add additional context in comments.

3 Comments

Interesting. Out of memory on my notebook could be possible, although I do not think the other machine (at university lab with tones of memory) has the same problem. (Will check next time, if it has the same 'smallest size causing problem')
Right, that would make sense. You might be able to use memmap no problems then.
I have just checked it. The server has the same limit. Unfortunately for my specific application the memmap is slower than my workaround. But it is good to know that there are memmaps. Perhaps I will need them someday.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.