1

I am trying to manually iterate through the chunks of a dask array, one by one, and apply my computation. I understand that a benefit of dask is that it can to do the iteration for me, but my computation is failing (for reasons that I don't think are related to dask) and I want to iterate through manually for the purpose of debugging. How would I do that?

I am imagining something like:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

Where the chunk that I get back has some information about which chunk I am in, and there would also be get the data for that chunk.

3 Answers 3

1

Access the data within each chunk using the arr.blocks property. The BlockView object has an array-like interface, but accessing an element in the BlockView array returns the selected chunk(s) in the original array:

In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>

In [13]: data.blocks.shape
Out[13]: (1, 10, 10)

In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14,  5, 24, ..., 25, 20,  6],
        [17, 12,  2, ..., 27, 13, 18],
        [13, 25,  2, ...,  7,  5, 22],
        ...,
        [12, 22, 26, ..., 15,  4, 11],
        [ 0, 26, 28, ..., 22, 14,  4],
        [ 9, 21, 14, ..., 15, 18, 21]],

       ...,

       [[ 3,  2, 20, ..., 27,  0, 12],
        [21, 17,  7, ..., 23,  3, 23],
        [24, 13,  0, ..., 26,  1,  0],
        ...,
        [ 5, 25,  6, ..., 22,  6, 16],
        [16, 25, 21, ..., 22, 14, 15],
        [ 8, 20, 17, ..., 29, 13,  1]]])

So in your case, you could loop through all blocks with the following:

In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
    ...:     chunk = data.blocks[inds]
    ...:     my_function(chunk)

This will be slow, but it does I think what you're looking for.

Sign up to request clarification or add additional context in comments.

Comments

0

Try using data.chunks instead of data.iterchunks().

1 Comment

That gives me the size of the chunks (here (1000,) (10, 10, 10, 10, 10, 10, 10, 10, 10, 10) (10, 10, 10, 10, 10, 10, 10, 10, 10, 10)) but not the actual data values. Do you know how I would convert from the chunk sizes to actual blocks of data?
0

You can use da.map_blocks and avoid the for-loop:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.