5

Why is there such a large speed difference between the following L2 norm calculations:

a = np.arange(1200.0).reshape((-1,3))

%timeit [np.sqrt((a*a).sum(axis=1))]
100000 loops, best of 3: 12 µs per loop

%timeit [np.sqrt(np.dot(x,x)) for x in a]
1000 loops, best of 3: 814 µs per loop

%timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 2 ms per loop

All three produce identical results as far as I can see.

Here's the source code for numpy.linalg.norm function:

x = asarray(x)

# Check the default case first and handle it immediately.
if ord is None and axis is None:
    x = x.ravel(order='K')
    if isComplexType(x.dtype.type):
        sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag)
    else:
        sqnorm = dot(x, x)
    return sqrt(sqnorm)

EDIT: Someone suggested that one version could be parallelized, but I checked and it's not the case. All three versions consume 12.5% of CPU (as is usually the case with Python code on my 4 physical / 8 virtual core Xeon CPU.

5
  • 2
    A couple more to time: [math.sqrt(np.dot(x,x)) for x in a], np.sqrt(np.einsum('ij,ij->i',a,a)) Commented Aug 18, 2015 at 23:35
  • The main difference is in what is being done in interpreted Python code, and what is done in compiled C code. Commented Aug 18, 2015 at 23:38
  • One thing I noticed is that the first method gives results with much smaller precision than the others. For example the final number produced by the first method is 2074.9973494 while that of the last two is 2074.9973493958973. Commented Aug 18, 2015 at 23:38
  • Tris Nefzger, I checked the results with dtype, in all three cases it's float64. Commented Aug 18, 2015 at 23:52
  • Lists and arrays have different rules for displaying the significant figures of floats. So the display doesn't tell you much of the float type. Commented Aug 19, 2015 at 1:43

1 Answer 1

5

np.dot will usually call a BLAS library function - its speed will thus depend on which BLAS library your version of numpy is linked against. In general I would expect it to have a greater constant overhead, but to scale much better as the array size increases. However, the fact that you are calling it from within a list comprehension (effectively a normal Python for loop) will likely negate any performance benefits of using BLAS.

If you get rid of the list comprehension and use the axis= kwarg, np.linalg.norm is comparable to your first example, but np.einsum is much faster than both:

In [1]: %timeit np.sqrt((a*a).sum(axis=1))
The slowest run took 10.12 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 11.1 µs per loop

In [2]: %timeit np.linalg.norm(a, axis=1)
The slowest run took 14.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 13.5 µs per loop

# this is what np.linalg.norm does internally
In [3]: %timeit np.sqrt(np.add.reduce(a * a, axis=1))
The slowest run took 34.05 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.7 µs per loop

In [4]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
The slowest run took 5.55 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.42 µs per loop
Sign up to request clarification or add additional context in comments.

5 Comments

I'm curious, what changes under the hood when we replace Python list comprehension with axis=1 argument? Or, why is Python "for loop" is slower than C "for loop"?
To answer your first question, when you use vectorization (i.e. axis=1 rather than the list comp) numpy essentially does the loop over array elements at the level of C code rather than in Python. As for why looping in C is faster than Python... that's quite a big question. The simple answer is that slower runtime performance is the price you pay in exchange for Python's nice high-level language features like type checking, automatic garbage collection etc. (see this Q/A for a bit more detail).
By vectorization do you mean Intel AVX? Regarding Python vs C, it's not very clear to me: is Python bytecode still doing all that checking/garbage collection/etc, or is it only done once during the compilation to the bytecode? If the bytecode does not have that overhead, what makes it slower than the native machine code?
In numpy-land we tend to use "vectorization" loosely to refer to array programming. The bytecode produced by the Python compiler is not the same sort of thing as native instructions that C etc. compile down to, but rather it's a set of instructions for a virtual machine which does the type checking, garbage collection etc. at runtime. There other Python implementations besides the standard CPython that attempt various runtime optimizations (for example you may have heard of PyPy, which uses an optimizing JIT compiler).
This is all goes way beyond the scope of what I can reasonably answer within the comments on an SE question. You can find some more details about the CPython internals here and here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.