Comparing two numpy arrays of different length

Question

I need to find the indices of the first less than or equal occurrence of elements of one array in another array. One way that works is this:

import numpy
a = numpy.array([10,7,2,0])
b = numpy.array([10,9,8,7,6,5,4,3,2,1])
indices = [numpy.where(a<=x)[0][0] for x in b]

indices has the value [0, 1, 1, 1, 2, 2, 2, 2, 2, 3], which is what I need. The problem of course, is that python "for" loop is slow and my arrays might have millions of elements. Is there any numpy trick for this? This doesn't work because they arrays are not of the same length:

indices = numpy.where(a<=b) #XXX: raises an exception

Thanks!

Can you provide reasonable sizes for a and b? The timings of methods are heavily influenced by this. — Daniel
– Daniel, Commented Sep 18, 2013 at 15:52

Daniel · Accepted Answer · 2013-09-18 22:48:55Z

15

This may be a special case, but you should be able to use numpy digitize. The caveat here is the bins must be monotonically decreasing or increasing.

>>> import numpy
>>> a = numpy.array([10,7,2,0])
>>> b = numpy.array([10,9,8,7,6,5,4,3,2,1])

>>> indices = [numpy.where(a<=x)[0][0] for x in b]
[0, 1, 1, 1, 2, 2, 2, 2, 2, 3]

>>> numpy.digitize(b,a)
array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3])

Setup for the timing test:

a = np.arange(50)[::-1]

b = np.random.randint(0,50,1E3)

np.allclose([np.where(a<=x)[0][0] for x in b],np.digitize(b,a))
Out[55]: True

Some timings:

%timeit [np.where(a<=x)[0][0] for x in b]
100 loops, best of 3: 4.97 ms per loop

%timeit np.digitize(b,a)
10000 loops, best of 3: 48.1 µs per loop

Looks like two orders of magnitude speed up, this will depend heavily on the number of bins however. Your timings will vary.

To compare to Jamie's answer I have timed the two following pieces of code. As I mainly wanted to focus on the speed of searchsorted vs digitize I pared down Jamie's code a bit. The relevant chunk is here:

a = np.arange(size_a)[::-1]
b = np.random.randint(0, size_a, size_b)

ja = np.take(a, np.searchsorted(a, b, side='right', sorter=a)-1)

#Compare to digitize
if ~np.allclose(ja,np.digitize(b,a)):
    print 'Comparison failed'

timing_digitize[num_a,num_b] = timeit.timeit('np.digitize(b,a)',
                      'import numpy as np; from __main__ import a, b',
                      number=3)
timing_searchsorted[num_a,num_b] = timeit.timeit('np.take(a, np.searchsorted(a, b, side="right", sorter=a)-1)',
                      'import numpy as np; from __main__ import a, b',
                      number=3)

This is a bit beyond my limited matplotlib ability so this is done in DataGraph. I have plotted the logarithmic ratio of timing_digitize/timing_searchsorted so values greater then zero searchsorted is faster and values less then zero digitize is faster. The colors also give relative speeds. For example is shows that in the top right (a = 1E6, b=1E6) digitize is ~300 times slower then searchsorted while for smaller sizes digitize can be up to 10x faster. The black line is roughly the break even point:

enter image description here Looks like for raw speed searchsorted is almost always faster for large cases, but the simple syntax of digitize is nearly as good if the number of bins is small.

edited Sep 18, 2013 at 22:48

answered Sep 18, 2013 at 15:23

Daniel

19.6k7 gold badges64 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

askewchan Over a year ago

+1 - I think this is correct for all cases specified by the question: "I need the indices of the first less than or equal occurrence of elements of one array in another array."

Jaime Over a year ago

I am not sure about np.digitize, but np.searchsorted does a binary search, so its strength should be large a arrays, rather than large b arrays. Setting a = np.arange(1000)[::-1] and b = np.random.randint(0, 1000, 1E6) and timing seems to confirms this.

Daniel Over a year ago

@Jamie Thats what I discovered also, so I temporarily removed timings for your example- it was not really fair to compare the two without more examples then what I wanted to go into.

Daniel Over a year ago

@Jamie updated the question with some relevant timings, let me know if you see any obvious flaws. I left np.take in as its scaling is likely N while searchsorted would be the dominant N log(N) operation.

Jaime Over a year ago

I wish I could give you another +1 for the wonderful time analysis!

Jaime · Accepted Answer · 2013-09-18 15:44:22Z

6

This is messy, but it works:

>>> idx = np.argsort(a)
>>> np.take(idx, np.searchsorted(a, b, side='right', sorter=idx)-1)
array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3], dtype=int64)

If your array is always sorted, you should be able to get rid of the argsort call.

answered Sep 18, 2013 at 15:44

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

Collectives™ on Stack Overflow

Comparing two numpy arrays of different length

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related