10

I'm using pandas to manage a large array of 8-byte integers. These integers are included as space-delimited elements of a column in a comma-delimited CSV file, and the array size is about 10000x10000.

Pandas is able to quickly read the comma-delimited data from the first few columns as a DataFrame, and also quickly store the space-delimited strings in another DataFrame with minimal hassle. The trouble comes when I try to cast transform the table from a single column of space-delimited strings to a DataFrame of 8-bit integers.

I have tried the following:

intdata = pd.DataFrame(strdata.columnname.str.split().tolist(), dtype='uint8')

But the memory usage is unbearable - 10MB worth of integers consumes 2GB of memory. I'm told that it's a limitation of the language and there's nothing I can do about it in this case.

As a possible workaround, I was advised to save the string data to a CSV file and then reload the CSV file as a DataFrame of space-delimited integers. This works well, but to avoid the slowdown that comes from writing to disk, I tried writing to a StringIO object.

Here's a minimal non-working example:

import numpy as np
import pandas as pd
from cStringIO import StringIO

a = np.random.randint(0,256,(10000,10000)).astype('uint8')
b = pd.DataFrame(a)
c = StringIO()
b.to_csv(c, delimiter=' ', header=False, index=False)
d = pd.io.parsers.read_csv(c, delimiter=' ', header=None, dtype='uint8')

Which yields the following error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 443, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 228, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 533, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 670, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1032, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 486, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4494)
ValueError: No columns to parse from file

Which is puzzling, because if I run the exact same code with 'c.csv' instead of c, the code works perfectly. Also, if I use the following snippet:

file = open('c.csv', 'w')
file.write(c.getvalue())

The CSV file gets saved without any problems, so writing to the StringIO object is not the issue.

It is possible that I need to replace c with c.getvalue() in the read_csv line, but when I do that, the interpreter tries to print the contents of c in the terminal! Surely there is a way to work around this.

Thanks in advance for the help.

1 Answer 1

16

There are two issues here, one fundamental and one you simply haven't come across yet. :^)

First, after you write to c, you're at the end of the (virtual) file. You need to seek back to the start. We'll use a smaller grid as an example:

>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, delimiter=' ', header=False, index=False)
>>> next(c)
Traceback (most recent call last):
  File "<ipython-input-57-73b012f9653f>", line 1, in <module>
    next(c)
StopIteration

which generates the "no columns" error. If we seek first, though:

>>> c.seek(0)
>>> next(c)
'103,3,171,239,150,35,224,190,225,57\n'

But now you'll notice the second issue-- commas? I thought we requested space delimiters? But to_csv only accepts sep, not delimiter. Seems to me it should either accept it or object that it doesn't, but silently ignoring it feels like a bug. Anyway, if we use sep (or delim_whitespace=True):

>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, sep=' ', header=False, index=False)
>>> c.seek(0)
>>> d = pd.read_csv(c, sep=' ', header=None, dtype='uint8')
>>> d
     0    1    2    3    4    5    6    7    8    9
0  209   65  218  242  178  213  187   63  137  145
1  161  222   50   92  157   31   49   62  218   30
2  182  255  146  249  115   91  160   53  200  252
3  192  116   87   85  164   46  192  228  104  113
4   89  137  142  188  183  199  106  128  110    1
5  208  140  116   50   66  208  116   72  158  169
6   50  221   82  235   16   31  222    9   95  111
7   88   36  204   96  186  205  210  223   22  235
8  136  221   98  191   31  174   83  208  226  150
9   62   93  168  181   26  128  116   92   68  153
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.