1

I have a large amount of data in NetCDF4 files, and I am trying to write a script to dynamically chunk this data to hold as much in memory as possible, do calculations on it and save the results, then move on to the next chunk.

An example of what I am trying to do. Say I have an array like this:

import numpy as np
arr = np.random.randint(0, 10, (100, 15, 51))  # Call these x, y, and z coordinates

And I only want to read ten of the x coordinates at a time, like this:

placeholder = 0
for i in range(10, 101, 10):
    tmp_array = arr[placeholder:i, :, :]
    # Do calculations here and save results to file or database
    placeholder += 10

Is there some sort of built-in method for this? In this simple example it works pretty well, but as things get more complicated this seems like it could get to be a headache for me to manage all of this myself. I am aware of Dask, but it is unhelpful to me in this situation because I am not doing array operations with the data. Although Dask could be useful to me if it had methods to deal with this too.

3 Answers 3

2

You can reduce the complexity and increase the robustness by implementing a lazy generator that encapsulates the computation you're worried about and just returns the chunk at each step. Something like this perhaps:

def spliterate(buf, chunk):
    for start in range(0, len(buf), chunk):
        yield buf[start:start + chunk]

Using it is pretty straightforward:

for tmp in spliterate(arr, 10):
    # do calculations on tmp, don't worry about bookkeeping
Sign up to request clarification or add additional context in comments.

7 Comments

So I like the look of this solution and it seems to work well with what I am trying to accomplish. So this generator that is returned doesn't actually read the slice into memory until it is called in the for loop, correct? I haven't actually used a generator before so I'm just trying to make sure I understand.
@Wade. Assuming that that's how your HDF package works, that will be the case. A normal numpy array exists only in memory to begin with. The generator will not access a chunk until the loop gets to it. Previous chunks will be garbage collected. For an in-memory numpy array, the chunk would be a view that wouldn't allocate a copy of the data.
@julesG.M. I don't believe your edit is particularly useful so I rolled it back. If you feel it's important, feel free to add a comment.
yes, it's wrong like that
@JulesG.M. I reverted the call to len. You make a good point
|
2

The Dask documentation shows how to create chunked arrays for just the kind of computation you have in mind, for the case of hdf5 files: http://docs.dask.org/en/latest/array-creation.html#numpy-slicing . Your netCDF4 case may or may not work identically, but the section further down about delayed will do the trick, if not.

Having made your dask-array, you will want to use the map_blocks method for the "do something with each chunk" operation (this expects to get some output back), loop over the contents of the .blocks attribute, or use .to_delayed() to do arbitrary things with each piece. Exactly which is right for you depends on what you want to achieve.

Comments

0

You can use np.split, which takes an array and either a chunk size or a list of indices at which to perform the split. Your case would be np.split(arr, 10), giving you a list of 10 arrays of shape (10, 15, 51).

Note that an exception is raised if the axis cannot be equally divided, e.g., if you asked for chunks of size 9. If you want to split into nearly-equal chunks, without raising, you can use np.array_split instead.

2 Comments

But wouldn't this require the entire array to be held in memory in once? I am reading the big array from file, sorry if the example is unclear in that
Ahh, that wasn't clear. Yes, this method would hold the whole thing in memory, as a list of arrays. If you want a lazy version of np.split, you could use something from the itertools module in the standard library. Something like the grouper method from the itertools recipes would work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.