Built in Method for chunking a NumPy array

Question

I have a large amount of data in NetCDF4 files, and I am trying to write a script to dynamically chunk this data to hold as much in memory as possible, do calculations on it and save the results, then move on to the next chunk.

An example of what I am trying to do. Say I have an array like this:

import numpy as np
arr = np.random.randint(0, 10, (100, 15, 51))  # Call these x, y, and z coordinates

And I only want to read ten of the x coordinates at a time, like this:

placeholder = 0
for i in range(10, 101, 10):
    tmp_array = arr[placeholder:i, :, :]
    # Do calculations here and save results to file or database
    placeholder += 10

Is there some sort of built-in method for this? In this simple example it works pretty well, but as things get more complicated this seems like it could get to be a headache for me to manage all of this myself. I am aware of Dask, but it is unhelpful to me in this situation because I am not doing array operations with the data. Although Dask could be useful to me if it had methods to deal with this too.

Mad Physicist · Accepted Answer · 2021-12-10 19:57:49Z

2

You can reduce the complexity and increase the robustness by implementing a lazy generator that encapsulates the computation you're worried about and just returns the chunk at each step. Something like this perhaps:

def spliterate(buf, chunk):
    for start in range(0, len(buf), chunk):
        yield buf[start:start + chunk]

Using it is pretty straightforward:

for tmp in spliterate(arr, 10):
    # do calculations on tmp, don't worry about bookkeeping

edited Dec 10, 2021 at 19:57

answered Dec 30, 2018 at 3:33

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pythonweb Over a year ago

So I like the look of this solution and it seems to work well with what I am trying to accomplish. So this generator that is returned doesn't actually read the slice into memory until it is called in the for loop, correct? I haven't actually used a generator before so I'm just trying to make sure I understand.

Mad Physicist Over a year ago

@Wade. Assuming that that's how your HDF package works, that will be the case. A normal numpy array exists only in memory to begin with. The generator will not access a chunk until the loop gets to it. Previous chunks will be garbage collected. For an in-memory numpy array, the chunk would be a view that wouldn't allocate a copy of the data.

Mad Physicist Over a year ago

@julesG.M. I don't believe your edit is particularly useful so I rolled it back. If you feel it's important, feel free to add a comment.

Jules Gagnon-Marchand Over a year ago

yes, it's wrong like that

Mad Physicist Over a year ago

@JulesG.M. I reverted the call to len. You make a good point

|

mdurant · Accepted Answer · 2018-12-29 21:09:40Z

2

The Dask documentation shows how to create chunked arrays for just the kind of computation you have in mind, for the case of hdf5 files: http://docs.dask.org/en/latest/array-creation.html#numpy-slicing . Your netCDF4 case may or may not work identically, but the section further down about delayed will do the trick, if not.

Having made your dask-array, you will want to use the map_blocks method for the "do something with each chunk" operation (this expects to get some output back), loop over the contents of the .blocks attribute, or use .to_delayed() to do arbitrary things with each piece. Exactly which is right for you depends on what you want to achieve.

answered Dec 29, 2018 at 21:09

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Comments

bnaecker · Accepted Answer · 2018-12-29 18:54:57Z

0

You can use np.split, which takes an array and either a chunk size or a list of indices at which to perform the split. Your case would be np.split(arr, 10), giving you a list of 10 arrays of shape (10, 15, 51).

Note that an exception is raised if the axis cannot be equally divided, e.g., if you asked for chunks of size 9. If you want to split into nearly-equal chunks, without raising, you can use np.array_split instead.

answered Dec 29, 2018 at 18:54

bnaecker

6,5301 gold badge24 silver badges36 bronze badges

2 Comments

pythonweb Over a year ago

But wouldn't this require the entire array to be held in memory in once? I am reading the big array from file, sorry if the example is unclear in that

bnaecker Over a year ago

Ahh, that wasn't clear. Yes, this method would hold the whole thing in memory, as a list of arrays. If you want a lazy version of np.split, you could use something from the itertools module in the standard library. Something like the grouper method from the itertools recipes would work.

Collectives™ on Stack Overflow

Built in Method for chunking a NumPy array

3 Answers 3

7 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related