Specifying different dtypes while reading large ASCII as numpy array with np.fromiter

Question

I am trying to implement the solution given in this answer to read my ~3.3GB ASCII into a ndarray.

Actually, I am getting an MemoryError when using this function against my file:

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt(fname,skiprows=1)

I am now trying to input different dtypes in the call to np.fromiter, in the hope that if most of my columns are integers and not floats I will have luck enough to avoid the Memory issue, but I had no success so far.

My file is "many rows" X 7 cols, and I'd like to specify the following formats: float for the first three cols, and uint for the following. My OS is Windows 10 64bit, and I have 8GB of RAM. I am using python 2.7 32bit.

My try was (following this answer):

data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

but I receive TypeError: expected a readable buffer object

EDIT1

Thanks to hpaulj who provided the solution. Below is the working code.

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        dtypes = [float, float, float, int, int, int, int]
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                values = [t(v) for t, v in zip(dtypes, line)]
                yield tuple(values)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

    return data

data = iter_loadtxt(fname,skiprows=1)

Your very first step should be to stop using 32 bit Python and use 64 bit Python instead. This will unlock the rest of the memory on your machine. — John Zwinck
– John Zwinck, Commented Dec 1, 2016 at 12:09
Did you test this on a small file? The iter_func produces a stream of floats, without any grouping by line. I doubt if from_iter can handle a compound dtype. — hpaulj
– hpaulj, Commented Dec 1, 2016 at 15:37
@JohnZwinck Indeed. The 64bit version of Python led me processing the whole file. Thanks. — umbe1987
– umbe1987, Commented Dec 6, 2016 at 16:00

hpaulj · Accepted Answer · 2016-12-06 14:29:03Z

1

With a big enough input file, any code, however streamlined can hit a memory error.

With all floats your 7 column array would occupy 56 bytes; with the mixed dtype 40. Not exactly a big change. If it's hitting the memory error 1/3 of the way through the file before, it will now hit it (in theory 1/2 the way through).

iter_func reads the file, and feeds a steady stream of floats (it's own dtype). It does not return floats grouped by line. It keeps a count of lines, which is used at the end to reshape the 1d array.

fromiter can handle a compound dtype, but only if you feed it appropriate sized tuples.

In [342]: np.fromiter([(1,2),(3,4),(5,6)],dtype=np.dtype('i,i'))
Out[342]: 
array([(1, 2), (3, 4), (5, 6)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

In [343]: np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-343-d0fc5f822886> in <module>()
----> 1 np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))

TypeError: a bytes-like object is required, not 'int'

Changing iter_func to something like this might work (not tested):

def iter_func():
    dtypes=[float,float,float,int,int,int,int]
    with open(filename, 'r') as infile:
        for _ in range(skiprows):
            next(infile)
        for line in infile:
            line = line.rstrip().split(delimiter)
            values = [t(v) for t,v in zip(dtypes, line)]
            yield tuple(values)
arr = np.fromiter(iter_func, dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)] )

edited Dec 6, 2016 at 14:29

answered Dec 1, 2016 at 16:36

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

umbe1987 Over a year ago

Thanks, I will try this out. However, I installed a 64bit version of python and I went through the end of the process (although with a huge amount of time!). I think this would be a good way to lower the computing time consitsently according to your previous answer (which I linked in my own question). When I'll verify that this is the case, and that your new code works, I will be happy to accept your answer.

umbe1987 Over a year ago

what am I supposed to use as dtype argument in arr = np.fromiter(iter_func, dtype=...)? I am not quite sure to understand it... Sorry for aasking, but from the function page it's not that clear.

hpaulj Over a year ago

I defined the dtypes list in iter_func to match the dtype parameter you tried in your question - the one with a mix of floats and ints.

umbe1987 Over a year ago

Sorry to bother again. What is not clear for me is what am I supposed to provide to the dtype parameter in the fromiter function (instead of the ...)? from your comments and answer, it seems to me that I have to provide [float,float,float,int,int,int,int], which is however already defined within the iter_func, but I am probably wrong. If will add the improved code in the body of my question so that you can easily understand what I am doing.

hpaulj Over a year ago

See my edit - I copied the dtype from your original post.

|

Collectives™ on Stack Overflow

Specifying different dtypes while reading large ASCII as numpy array with np.fromiter

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related