0

I want to create a dataset B by processing a dataset A. Therefore, every column in A (~ 2 Mio.) has to be processed in a batch-fashion (putting through a neural network), resulting in 3 outputs which are stacked together and then e.g. stored in a numpy array.

My code looks like the following, which seems to be not the best solution.

# Load data
data = get_data()

# Storage for B
B = np.empty(shape=data.shape)

# Process data
for idx, data_B in enumerate(data):
    # Process data
    a, b, c = model(data_B)

    # Reshape and feed in B
    B[idx * batch_size:batch_size * (idx + 1)] = np.squeeze(np.concatenate((a, b, c), axis=1))

I am looking for ideas to speed up the stacking or assigning process. I do not know if it is possible for parallel processing since everything should be stored in the same array finally (the ordering is not important). Is there any python framework I can use?

Loading the data takes 29s (only done once), stacking and assigning takes 20s for a batch size of only 2. The model command takes < 1s, allocating the array takes 5s and all other part <1s.

3
  • 2
    My guess is that processing the data (the model call) takes longer than the iteration mechanism, including stacking. You could,of course, test that with a comparison run that does minimal processing. Commented Jan 25, 2018 at 8:49
  • I know pandas framework is focused on dataset processing, and allows you to transform datasets efficiently (uses numpy arrays under the hood). You could give it a try. Commented Jan 25, 2018 at 9:46
  • No that is not the case. Please see my additions. Commented Jan 25, 2018 at 12:37

1 Answer 1

1

Your arrays shapes, and especially number of dimensions, is unclear. I can make a few guesses from what works in the code. Your times suggest that things are very large, so memory management may a big issue. Creating large temporary arrays takes time.

What is data.shape? Probably 2d at least; B has the same shape

B = np.empty(shape=data.shape)

Now you iterate on the 1st dimension of data; lets call them rows, though they might be 2d or larger:

# Process data
for idx, data_B in enumerate(data):
    # Process data
    a, b, c = model(data_B)

What the nature of a, etc. I'm assuming arrays, with a shape similar to data_B. But that just a guess.

    # Reshape and feed in B
    B[idx * batch_size:batch_size * (idx + 1)] =
         np.squeeze(np.concatenate((a, b, c), axis=1)

For concatenate to work a,b,c must be 2d (at least). Lets guess they are all (n,m). The result is (n,3m). Why the squeeze? Is the shape (1,3m)?

I don't know batch_size. But with anything other than 1 I don't think this works. B[idx:idx+1, :] = ... works since idx ranges the B.shape[0], but with other values it would produce an error.

With this batchsize slice indexing it almost looks like you are trying to string out the iteration values in a long 1d array, batchsize values per iteration. But that doesn't fit with B matching data in shape.

That puzzle aside, I wonder if you really need the concatenate. Can you initial B so you can assign values directly, e.g.

B[idx, 0, ...] = a
B[idx, 1, ...] = b
etc

Reshaping a array after filling is trivial. Even transposing axes isn't too time consuming.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, your last suggestion gave me a speedup of ~30%!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.