Iterating through dask array chunks

Question

I am trying to manually iterate through the chunks of a dask array, one by one, and apply my computation. I understand that a benefit of dask is that it can to do the iteration for me, but my computation is failing (for reasons that I don't think are related to dask) and I want to iterate through manually for the purpose of debugging. How would I do that?

I am imagining something like:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

Where the chunk that I get back has some information about which chunk I am in, and there would also be get the data for that chunk.

Michael Delgado · Accepted Answer · 2022-04-26 17:48:09Z

Access the data within each chunk using the arr.blocks property. The BlockView object has an array-like interface, but accessing an element in the BlockView array returns the selected chunk(s) in the original array:

In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>

In [13]: data.blocks.shape
Out[13]: (1, 10, 10)

In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14,  5, 24, ..., 25, 20,  6],
        [17, 12,  2, ..., 27, 13, 18],
        [13, 25,  2, ...,  7,  5, 22],
        ...,
        [12, 22, 26, ..., 15,  4, 11],
        [ 0, 26, 28, ..., 22, 14,  4],
        [ 9, 21, 14, ..., 15, 18, 21]],

       ...,

       [[ 3,  2, 20, ..., 27,  0, 12],
        [21, 17,  7, ..., 23,  3, 23],
        [24, 13,  0, ..., 26,  1,  0],
        ...,
        [ 5, 25,  6, ..., 22,  6, 16],
        [16, 25, 21, ..., 22, 14, 15],
        [ 8, 20, 17, ..., 29, 13,  1]]])

So in your case, you could loop through all blocks with the following:

In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
    ...:     chunk = data.blocks[inds]
    ...:     my_function(chunk)

This will be slow, but it does I think what you're looking for.

Flow · Accepted Answer · 2022-04-26 14:03:01Z

0

Try using data.chunks instead of data.iterchunks().

answered Apr 26, 2022 at 14:03

Flow

1,0701 gold badge13 silver badges28 bronze badges

1 Comment

Rachel W Over a year ago

That gives me the size of the chunks (here (1000,) (10, 10, 10, 10, 10, 10, 10, 10, 10, 10) (10, 10, 10, 10, 10, 10, 10, 10, 10, 10)) but not the actual data values. Do you know how I would convert from the chunk sizes to actual blocks of data?

darthbith · Accepted Answer · 2022-04-28 02:52:14Z

0

You can use da.map_blocks and avoid the for-loop:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)

answered Apr 28, 2022 at 2:52

darthbith

19.8k11 gold badges64 silver badges81 bronze badges

Collectives™ on Stack Overflow

Iterating through dask array chunks

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related