Idea to speed up array processing

Question

I want to create a dataset B by processing a dataset A. Therefore, every column in A (~ 2 Mio.) has to be processed in a batch-fashion (putting through a neural network), resulting in 3 outputs which are stacked together and then e.g. stored in a numpy array.

My code looks like the following, which seems to be not the best solution.

# Load data
data = get_data()

# Storage for B
B = np.empty(shape=data.shape)

# Process data
for idx, data_B in enumerate(data):
    # Process data
    a, b, c = model(data_B)

    # Reshape and feed in B
    B[idx * batch_size:batch_size * (idx + 1)] = np.squeeze(np.concatenate((a, b, c), axis=1))

I am looking for ideas to speed up the stacking or assigning process. I do not know if it is possible for parallel processing since everything should be stored in the same array finally (the ordering is not important). Is there any python framework I can use?

Loading the data takes 29s (only done once), stacking and assigning takes 20s for a batch size of only 2. The model command takes < 1s, allocating the array takes 5s and all other part <1s.

My guess is that processing the data (the model call) takes longer than the iteration mechanism, including stacking. You could,of course, test that with a comparison run that does minimal processing. — hpaulj
– hpaulj, Commented Jan 25, 2018 at 8:49
I know pandas framework is focused on dataset processing, and allows you to transform datasets efficiently (uses numpy arrays under the hood). You could give it a try. — Jorge Bellon
– Jorge Bellon, Commented Jan 25, 2018 at 9:46

hpaulj · Accepted Answer · 2018-01-25 18:08:41Z

Your arrays shapes, and especially number of dimensions, is unclear. I can make a few guesses from what works in the code. Your times suggest that things are very large, so memory management may a big issue. Creating large temporary arrays takes time.

What is data.shape? Probably 2d at least; B has the same shape

B = np.empty(shape=data.shape)

Now you iterate on the 1st dimension of data; lets call them rows, though they might be 2d or larger:

# Process data
for idx, data_B in enumerate(data):
    # Process data
    a, b, c = model(data_B)

What the nature of a, etc. I'm assuming arrays, with a shape similar to data_B. But that just a guess.

    # Reshape and feed in B
    B[idx * batch_size:batch_size * (idx + 1)] =
         np.squeeze(np.concatenate((a, b, c), axis=1)

For concatenate to work a,b,c must be 2d (at least). Lets guess they are all (n,m). The result is (n,3m). Why the squeeze? Is the shape (1,3m)?

I don't know batch_size. But with anything other than 1 I don't think this works. B[idx:idx+1, :] = ... works since idx ranges the B.shape[0], but with other values it would produce an error.

With this batchsize slice indexing it almost looks like you are trying to string out the iteration values in a long 1d array, batchsize values per iteration. But that doesn't fit with B matching data in shape.

That puzzle aside, I wonder if you really need the concatenate. Can you initial B so you can assign values directly, e.g.

B[idx, 0, ...] = a
B[idx, 1, ...] = b
etc

Reshaping a array after filling is trivial. Even transposing axes isn't too time consuming.

Collectives™ on Stack Overflow

Idea to speed up array processing

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related