0

I am trying to implement the solution given in this answer to read my ~3.3GB ASCII into a ndarray.

Actually, I am getting an MemoryError when using this function against my file:

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt(fname,skiprows=1)

I am now trying to input different dtypes in the call to np.fromiter, in the hope that if most of my columns are integers and not floats I will have luck enough to avoid the Memory issue, but I had no success so far.

My file is "many rows" X 7 cols, and I'd like to specify the following formats: float for the first three cols, and uint for the following. My OS is Windows 10 64bit, and I have 8GB of RAM. I am using python 2.7 32bit.

My try was (following this answer):

data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

but I receive TypeError: expected a readable buffer object

EDIT1

Thanks to hpaulj who provided the solution. Below is the working code.

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        dtypes = [float, float, float, int, int, int, int]
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                values = [t(v) for t, v in zip(dtypes, line)]
                yield tuple(values)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

    return data

data = iter_loadtxt(fname,skiprows=1)
3
  • 1
    Your very first step should be to stop using 32 bit Python and use 64 bit Python instead. This will unlock the rest of the memory on your machine. Commented Dec 1, 2016 at 12:09
  • Did you test this on a small file? The iter_func produces a stream of floats, without any grouping by line. I doubt if from_iter can handle a compound dtype. Commented Dec 1, 2016 at 15:37
  • @JohnZwinck Indeed. The 64bit version of Python led me processing the whole file. Thanks. Commented Dec 6, 2016 at 16:00

1 Answer 1

1

With a big enough input file, any code, however streamlined can hit a memory error.

With all floats your 7 column array would occupy 56 bytes; with the mixed dtype 40. Not exactly a big change. If it's hitting the memory error 1/3 of the way through the file before, it will now hit it (in theory 1/2 the way through).

iter_func reads the file, and feeds a steady stream of floats (it's own dtype). It does not return floats grouped by line. It keeps a count of lines, which is used at the end to reshape the 1d array.

fromiter can handle a compound dtype, but only if you feed it appropriate sized tuples.

In [342]: np.fromiter([(1,2),(3,4),(5,6)],dtype=np.dtype('i,i'))
Out[342]: 
array([(1, 2), (3, 4), (5, 6)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

In [343]: np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-343-d0fc5f822886> in <module>()
----> 1 np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))

TypeError: a bytes-like object is required, not 'int'

Changing iter_func to something like this might work (not tested):

def iter_func():
    dtypes=[float,float,float,int,int,int,int]
    with open(filename, 'r') as infile:
        for _ in range(skiprows):
            next(infile)
        for line in infile:
            line = line.rstrip().split(delimiter)
            values = [t(v) for t,v in zip(dtypes, line)]
            yield tuple(values)
arr = np.fromiter(iter_func, dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)] )
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks, I will try this out. However, I installed a 64bit version of python and I went through the end of the process (although with a huge amount of time!). I think this would be a good way to lower the computing time consitsently according to your previous answer (which I linked in my own question). When I'll verify that this is the case, and that your new code works, I will be happy to accept your answer.
what am I supposed to use as dtype argument in arr = np.fromiter(iter_func, dtype=...)? I am not quite sure to understand it... Sorry for aasking, but from the function page it's not that clear.
I defined the dtypes list in iter_func to match the dtype parameter you tried in your question - the one with a mix of floats and ints.
Sorry to bother again. What is not clear for me is what am I supposed to provide to the dtype parameter in the fromiter function (instead of the ...)? from your comments and answer, it seems to me that I have to provide [float,float,float,int,int,int,int], which is however already defined within the iter_func, but I am probably wrong. If will add the improved code in the body of my question so that you can easily understand what I am doing.
See my edit - I copied the dtype from your original post.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.