7

I want to search a sorted list of strings for all of the elements that start with a given substring.

Here's an example that finds all of the exact matches:

import bisect
names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect.bisect_right(names, 'bob')
print(names[leftIndex:rightIndex])

Which prints ['bob', 'bob', 'bob'].

Instead, I want to search for all the names that start with 'bob'. The output I want is ['bob', 'bob', 'bob', 'bobby', 'bobert']. If I could modify the comparison method of the bisect search, then I could use name.startswith('bob') to do this.

As an example, in Java it would be easy. I would use:

Arrays.binarySearch(names, "bob", myCustomComparator);

where 'myCustomComparator' is a comparator that takes advantage of the startswith method (and some additional logic).

How do I do this in Python?

1
  • 1
    Depending on your needs, you might be able to use a trie data structure Commented Sep 11, 2011 at 20:47

5 Answers 5

8

bisect can be fooled into using a custom comparison by using an instance that uses the custom comparator of your chosing:

>>> class PrefixCompares(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other[0:len(self.value)]
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 

DOH. the right bisect worked, but the left one obviously didn't. "adam" is not prefixed with "bob"!. to fix it, you have to adapt the sequence, too.

>>> class HasPrefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value[0:len(other.value)] < other.value
... 
>>> class Prefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other.value[0:len(self.value)]
... 
>>> class AdaptPrefix(object):
...     def __init__(self, seq):
...         self.seq = seq
...     def __getitem__(self, key):
...         return HasPrefix(self.seq[key])
...     def __len__(self):
...         return len(self.seq)
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> needle = Prefix('bob')
>>> haystack = AdaptPrefix(names)
>>> leftIndex = bisect.bisect_left(haystack, needle)
>>> rightIndex = bisect.bisect_right(haystack, needle)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 
Sign up to request clarification or add additional context in comments.

3 Comments

This won't work in all applications. It works for bisect_right(names, key) because the bisect_right code tests key < names[mid]. However, the custom comparison won't be used for bisect_left(names, key) because the bisect_left code tests names[mid] < key. You get the right result because the default behavior of bisect_left happens to return the desired result.
If you implement the __eq__() method and add the @total_ordering wrapper, then all other comparisons (in the same order) will work too. See Total Ordering.
@total_ordering is not available in all versions of python, and also not necessary for this example. A related link is the ActiveState recipe for the same thing
4

Unfortunately bisect does not allow you to specify a key function. What you can do though is add '\xff\xff\xff\xff' to the string before using it to find the highest index, then take those elements.

2 Comments

Very clever solution. I'll wait to see if anyone posts something more robust before accepting this.
Since Python 3.10, the bisect functions have a key argument docs.python.org/3/library/bisect.html
4

As an alternative to IfLoop's answer - why not use the __gt__ built-in?

>>> class PrefixCompares(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other[0:len(self.value)]
...     def __gt__(self, other):
...         return self.value[0:len(self.value)] > other
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']

Comments

1

Coming from functional programming background, I'm flabbergasted that there's not common binary search abstraction to which you can supply custom comparison functions.

To prevent myself from duplicating that thing over and over again or using gross and unreadable OOP hacks, I've simply written an equivalent of the Arrays.binarySearch(names, "bob", myCustomComparator); function you mentioned:

class BisectRetVal():
    LOWER, HIGHER, STOP = range(3)

def generic_bisect(arr, comparator, lo=0, hi=None): 
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(arr)
    while lo < hi:
        mid = (lo+hi)//2
        if comparator(arr, mid) == BisectRetVal.STOP: return mid
        elif comparator(arr, mid) == BisectRetVal.HIGHER: lo = mid+1
        else: hi = mid
    return lo

That was the generic part. And here are the specific comparators for your case:

def string_prefix_comparator_right(prefix):
    def parametrized_string_prefix_comparator_right(array, mid):
        if array[mid][0:len(prefix)] <= prefix:
            return BisectRetVal.HIGHER
        else:
            return BisectRetVal.LOWER
    return parametrized_string_prefix_comparator_right


def string_prefix_comparator_left(prefix):
    def parametrized_string_prefix_comparator_left(array, mid):
        if array[mid][0:len(prefix)] < prefix: # < is the only diff. from right
            return BisectRetVal.HIGHER
        else:
            return BisectRetVal.LOWER
    return parametrized_string_prefix_comparator_left

Here's the code snippet you provided adapted to this function:

>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> leftIndex = generic_bisect(names, string_prefix_comparator_left("bob"))
>>> rightIndex = generic_bisect(names, string_prefix_comparator_right("bob"))
>>> names[leftIndex:rightIndex]
['bob', 'bob', 'bob', 'bobby', 'bobert']

It works unaltered in both Python 2 and Python 3.

For more info on how this works and more comparators for this thing check out this gist: https://gist.github.com/Shnatsel/e23fcd2fe4fbbd869581

Comments

0

Here's a solution that hasn't been offered yet: re-implement the binary search algorithm.

This should usually be avoided because you're repeating code (and binary search is easy to mess up), but it seems there's no nice solution.

bisect_left() already gives the desired result, so we only need to change bisect_right(). Here's the original implementation for reference:

def bisect_right(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid]: hi = mid
        else: lo = mid+1
    return lo

And here's the new version. The only changes are that I add and not a[mid].startswith(x), and I call it "bisect_right_prefix":

def bisect_right_prefix(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid] and not a[mid].startswith(x): hi = mid
        else: lo = mid+1
    return lo

Now the code looks like this:

names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect_right_prefix(names, 'bob')
print(names[leftIndex:rightIndex])

Which produces the expected result:

['bob', 'bob', 'bob', 'bobby', 'bobert']

What do you think, is this the way to go?

1 Comment

Would be much more generic if the bisect function simply accepted a custom comparator as a function

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.