How to load a big numpy array from a text file with Dask?

Question

I have a text file containing data that I read to memory with numpy.genfromtxt enforcing a custom numpy.dtype. Although the text file is smaller than the available RAM, I often get a MemoryError (which I don't understand, but it is not the point of this question). When looking for ways to resolve it, I came across dask. In the API I found methods for data loading but none of them reads from text files, not to mention my need to support converters in genfromtxt().

I see there is a dask.dataframe.read_csv() method, but in my case I don't use pandas, but rather plain numpy.array with custom dtypes and colum names, as mentioned above. The text file I have is not CSV anyway (thus the abovementioned use of converters in genfromtxt()).

Any ideas on how could I handle it will be appreciated.

mdurant · Accepted Answer · 2020-08-25 14:01:27Z

1

You should use the function dask.bytes.read_bytes with delimiter="\n" to read your file(s) and split them into blocks at line-endings. You get back a set of dask.delayed objects, which you can pass to numpy. Unfortunately, numpy wants a file-like, so you must pack the bytes again:

import dask
import dask.array as da
_, blocks = dask.bytes.read_bytes(files, delimiter="\n")

@dask.delayed
def parse(block):
    return numpy.genfromtext(io.BytesIO(block), ...)

arrays = [da.from_delayed(parse(block), ...) for block in blocks]
arr = da.stack/concat(arrays)

answered Aug 25, 2020 at 14:01

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Maciek Over a year ago

Thank you, nevertheless when I call compute() on arr I get the following error: "TypeError: a bytes-like object is required, not 'list'" as a result of the attempt to pass io.BytesIO(block) to numpy.genfromtxt(). I think numpy.getnfromtxt() only expects a filename and can't work with byte blocks. Following your concept, I guess I shall split the original file to many files on disk and then call numpy.genfromtxt() on each of them, to later concat the resulting arrays, is that correct?

mdurant Over a year ago

Ah, i forgot that blocks is a list of lists. Please flatten, and this will work. Genfromtxt works with file-like objects.

Maciek Over a year ago

I had noticed the list issue, flattening did not help, but incorrectly I thought it was due to genfromtxt(). print(io.BytesIO(blocks[0])) yields # TypeError: a bytes-like object is required, not 'list' and print(io.BytesIO(blocks[0][0])) yields # TypeError: a bytes-like object is required, not 'Delayed', thus the issue seems related to how the Delayed object is not being trated as BytesIO, which I don't know how to fix again.

mdurant Over a year ago

Within the parse function, blocks[0][0] is a bytes object

Maciek Over a year ago

Indeed it is (it is not clear to me why). To make it work I also had to add the shape argument to the call to from_delayed(). I don't get it either, why can't Dask figure the shape on its own, I do not have to a priori know the size of arrays in text files. Also stack worked, but not concat.

|

Maciek · Accepted Answer · 2020-09-09 09:25:26Z

1

SO editors rejected my edit to @mdurant's answer, thus, I post the working code (based on that answer) here:

import numpy 
import dask
import dask.array as da
import io

fname = 'data.txt'
# data.txt is:
# 1 2
# 3 4
# 5 6

files = [fname]
_, blocks = dask.bytes.read_bytes(files, delimiter="\n")

my_type = numpy.dtype([
                    ('f1', numpy.float64),
                    ('f2', numpy.float64)
                    ])

native_type = numpy.float
used_type = numpy.float64
# If the below line is uncommented, then creating the dask array will work, but it won't
# be possible to perform any operations on it
# used_type = my_type

# Debug
# print('blocks', blocks)
# print('type(blocks)', type(blocks))
# print('blocks[0]', blocks[0])
# print('type(blocks[0])', type(blocks[0]))

@dask.delayed
def parse(block):
    r = numpy.genfromtxt(io.BytesIO(block[0]))
    print('parse() about to return:\n', r, '\n')
    return r

# Below I added shape, which seems compulsatory, the reason for which I don't understand
arrays = [da.from_delayed(value=parse(block), shape=(3, ), dtype=used_type) for block in blocks]
# da.concat did not not work for me
arr = da.stack(arrays)
# The below will not work if used_type is set to my_type
arr += 1
# Neither the below woudl work, it raises NotImplementedError
# arr['f1'] += 1
arr_np = arr.compute()
print('numpy array incremented by one: \n', arr_np)

answered Sep 9, 2020 at 9:25

Maciek

8029 silver badges22 bronze badges

2 Comments

mdurant Over a year ago

Sorry, I was not notified of any edit! Glad you got it working.

Maciek Over a year ago

It was not you, it was some "senior" SO users who review edits. They rejected mine. Anyway, I posted this answer in case someone in the future needs it. Thank you for you help.

Collectives™ on Stack Overflow

How to load a big numpy array from a text file with Dask?

2 Answers 2

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related