Numpy array broadcasting vs explicit dimension expansion space inefficiency

Question

Why is the explicit dimension expansion so space inefficient as compared to leveraging implicit numpy broadcasting, which technically does the same thing i.e. copies the matrix over on a given dimension.

I have two arrays: X(500,3072) and X_train(5000,3072). I want to calculate distance of all 500 points in X from the 5000 points in X_train. When I try to do this via explicit dimension expansion, it takes over 60GB of space to do this calculation.

dists = np.linalg.norm(np.expand_dims(X,axis=1)-(X_train), axis =2)

Whereas if I leverage numpy's broadcasting, it gets done within MBs of space.

dists = np.square(X).sum(axis=1, keepdims=True)+np.square(X_train).sum(axis=1, keepdims=True).T-2*np.dot(X, X_train.T)

Why is the explicit dimension expansion so space inefficient despite of such small matrices being used.

This isn't strictly what you asked, but cdist is usually the fastest way of doing a pairwise distance calculation for all pairs of rows. — Nick ODell
– Nick ODell, Commented Nov 4, 2023 at 23:25
It took me a bit to understand how the second one worked - this is a helpful resource. — Nick ODell
– Nick ODell, Commented Nov 4, 2023 at 23:47
Technically, both of these are using broadcasting. When it does np.expand_dims(X,axis=1)-(X_train), the subtraction is a broacast. It's also the step where it runs out of memory. — Nick ODell
– Nick ODell, Commented Nov 4, 2023 at 23:53

hpaulj · Accepted Answer · 2023-11-05 01:29:49Z

1

Tracking dimensions in the 2 alternatives:

X(500,3072) and X_train(5000,3072).  
(500,1,3072) - (1,5000, 30720) => (500,5000,30720)

norm squares this, and sums on axis 2 (30720) and returns the sqrt. (500,5000)

That 3d array is quite large, 600GB

np.square(X).sum(axis=1, keepdims=True)+np.square(X_train).sum(axis=1, keepdims=True).T-2*np.dot(X, X_train.T)

(500,3072)=>(500,1)
(5000,3872)=>(5000,1)==>(1,5000)
==> (500,5000)
dot (500,3072), (3872,5000) ==> (500,5000)
net (500,5000)

answered Nov 5, 2023 at 1:29

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Numpy array broadcasting vs explicit dimension expansion space inefficiency

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related