I've found myself using NumPy arrays for memory management and speed of computation more and more lately, on large volumes of structured data (such as points and polygons). In doing so, there is always a situation where I need to perform some function f(x) on the entire array. From experience, and Googling, iterating over the array is not the way to do this, so insted a function should be vectorized and broadcast to the entire array.
Looking at the documentation for numpy.vectorize we get this example:
def myfunc(a, b):
"Return a-b if a>b, otherwise return a+b"
if a > b:
return a - b
else:
return a + b
>>> vfunc = np.vectorize(myfunc)
>>> vfunc([1, 2, 3, 4], 2)
array([3, 4, 1, 2])
And per the docs it really just creates a for loop so it doesnt access the lower C loops for truly vectorized operations (either in BLAS or SIMD). So that got me wondering, if the above is "vectorized", what is this?:
def myfunc_2(a, b):
cond = a > b
a[cond] -= b
a[~cond] += b
return a
>>> myfunc_2(np.array([1, 2, 3, 4], 2))
array([3, 4, 1, 2])
Or even this:
>>> a = np.array([1, 2, 3, 4]
>>> b = 2
>>> np.where(a > b, a - b, a + b)
array([3, 4, 1, 2])
So I ran some tests on these, what I believe to be comparable examples:
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import vfunc, arr'
>>> timeit('vfunc(arr, 50)', setup=setup, number=1)
0.60175449999997
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_2, arr'
>>> timeit('myfunc_2(arr, 50)', setup=setup, number=1)
0.07464979999997468
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_3, arr'
>>> timeit('myfunc_3(arr, 50)', setup=setup, number=1)
0.0222587000000658
And with larger run windows:
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import vfunc, arr'
>>> timeit('vfunc(arr, 50)', setup=setup, number=1000)
621.5853878000003
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_2, arr'
>>> timeit('myfunc_2(arr, 50)', setup=setup, number=1000)
98.19819199999984
>>> arr = np.random.randint(200, size=(1000000,))
>>> setup = 'from __main__ import myfunc_3, arr'
>>> timeit('myfunc_3(arr, 50)', setup=setup, number=1000)
26.128515100000186
Clearly the other options are major improvements over using numpy.vectorize. This leads me to wonder several things about why anybody would use numpy.vectorize at all if you can write what appear to be "purely vectorized" functions or use battery provided functions like numpy.where.
Now for the questions:
- What are the requirements to say a function is "vectorized" if not converted via
numpy.vectorize? Just broadcastable in its entirety? - How does NumPy determine if a function is "vectorized"/broadcastable?
- Why isn't this form of vectorization documented anywhere? (i.e., why doesn't NumPy have a "How to write a vectorized function" page?