Efficient reuse of iterables in python

Question

Say I have an iterator that returns 100 values. At some unpredictable point, the values switch from one category to another (in this case, ints to strs), and the action performed needs to change too (updating a variable to printing the number of characters).

The most basic example code I guess would be this:

def perform_operations(iterable):
    most_recent_int = None
    for item in iterable:
        if type(item) is int:
            most_recent_int = item
        else:
            print(f"int of {len(item)} characters")
    return most_recent_int

iterable = list(range(100, 140)) + [str(i) for i in range(60)]
print(f"most recent int: {perform_operations(iterable)}")

This seems efficient, but that if statement also seems wasted on the last 59 or 60 entries as we can be sure the type of the items will not revert to int.

Here's a variant where two different suites are used, so that the last entries won't need an if statement:

def perform_operations(iterable):
    most_recent_int = None
    for item in (iterator := iter(iterable)):
        if type(item) is int:
            most_recent_int = item
        else:
            print(f"int of {len(item)} characters")
            break
    for remaining_str_item in iterator:
        print(f"int of {len(remaining_str_item)} characters")
    return most_recent_int

iterable = list(range(100, 140)) + [str(i) for i in range(60)]
print(f"most recent int: {perform_operations(iterable)}")

This looks like it should be more efficient.

The only thing I can think of that might perform fewer operations and possibly be interpreted quicker is:

def perform_operations(iterable):
    most_recent_int = None
    for item in (iterator := iter(iterable)):
        if type(item) is int:
            most_recent_int = item
        else:
            break
    while True:
        print(f"int of {len(item)} characters")
        try:
            item = next(iterator)
        except StopIteration:
            break
    return most_recent_int

iterable = list(range(100, 140)) + [str(i) for i in range(60)]
print(f"most recent int: {perform_operations(iterable)}")

Or maybe:

from itertools import islice

def perform_operations(iterable):
    most_recent_int = None
    for i, item in enumerate(iterable):
        if type(item) is int:
            most_recent_int = item
        else:
            break
    for remaining_str_item in islice(iterable, i, None):
        print(f"int of {len(remaining_str_item)} characters")
    return most_recent_int

iterable = list(range(100, 140)) + [str(i) for i in range(60)]
print(f"most recent int: {perform_operations(iterable)}")

Interested in the most efficient solutions, but also idiomatic and precedented solutions.

"that if statement also seems wasted on the last 59 or 60 entries" Let branch prediction take care for that. Don't fall into the premature optimization trap. Use profiling to see if that is really a bottleneck. If not, move on — DeepSpace
– DeepSpace, Commented Mar 23, 2021 at 16:27
As I understand it, branch prediction won't stop the if statement being evaluated (even if only to verify the predicted branch is correct), which means more unnecessary processing for the last entries. This would be more of an issue with a less simplified version involving an expensive if statement. — to7m
– to7m, Commented Mar 23, 2021 at 17:32

donkopotamus · Accepted Answer · 2021-03-23 22:08:31Z

In the example case, where the conditional type check is very fast, there is absolutely no point to a more complex solution.

However, let's suppose you have an iterable where it is costly to discriminate between the two categories X and Y, in which case the most efficient solution is one that attempts discrimination the least. Suppose:

we are not memory constrained and can freeze the iterable into a list L; and
we have a predicate is_y that returns True if the value is in category Y and False otherwise

Then you could consider your frozen L to be a list that is sorted under p. In which case the problem of determining where the category changes from X to Y reduces to a bisection search.

So ... in the upcoming Python 3.10, where bisect.bisect finally gains a key argument:

# freeze the iterable
frozen = list(iterable)

# find the position such that frozen[:pos] are all category X
pos = bisect.bisect_left(frozen, key=is_y)

# then proceed to deal with each half as you wish
for x in frozen[:pos]:
    do_x(x)

for y in frozen[pos:]:
    do_y(y)

Prior to Python 3.10, bisect does not support a key argument. In that case you can simply wrap each object to support __lt__, where that implementation simply evaluates the inverted predicate on the wrapped object.

class wrapper:
    def __init__(self, underlying):
        self.underlying = underlying

    def __lt__(self, other):
        return not is_y(self.underlying)

If you are memory constrained, for example the iterable is potentially very large, then this approach can still be used by iteratively:

reading into a fixed size buffer;
attempting a bisect_left on that buffer, if the returned value is len(buffer) then you have not yet reached the point where the category switches.

Caveat

This situation seems extremely unlikely in practice, nearly always category discrimination, either via if or try: ... except ... is cheap and worrying about it is definitely prematurely optimising code to make it un-readable and hard to reason about.

I like this solution. My example is simplified, but the original issues involve large collections that would certainly benefit from bisecting. Plus some of the iterables are accessible by index which help. I didn't even consider bisecting, but it might make things significantly easier to read. Thanks!

Collectives™ on Stack Overflow

Efficient reuse of iterables in python

1 Answer 1

Caveat

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Caveat

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related