0

I have a python for loop over a string, that have to decide char-by-char if the current char should stay or be removed.

Let's say I have a function fun(character,state) which receives a char, and some state parameter, and returns True if the char should be kept, and a new state. for example:

def fun(c,state):
    if state%5==0:
        return False, state+1
    else:
        return True, state+2

I have two ideas how to iterate (sequentially) over the string with this function, but I can't decide between them (or others) in terms of runtime complexity and space usage tradeoff:

Option 1:

def option1(string):
    state = 0
    i =0
    while i<len(string):
        answer, state = fun(string[i],state)
        if not answer:
             string = string[:i]+string[i+1:]
        else:
            i+=1
    return string

Option 2:

def option2(string):
    result  = ''
    state = 0
    for c in string:
        answer, state = fun(c,state)
        if answer:
             result+=c
   return result

Does anyone has a better approach/ solution?


edit: some timing results:

on a long string ( 67976 chars) the results are:

  • option1 - 205 ms
  • option2 - 46 ms

looks like eliminating the i'th char with string[:i]+string[i+1:] is not a good idea.

5
  • You can just measure the performance and see what happens. Your best bet in making this faster (if it matters) is avoiding the function call per character altogether. Commented Jun 29, 2017 at 13:36
  • I'm looking for a theoretical point of view, since just timing won't find a better approach. Commented Jun 29, 2017 at 13:39
  • I'm not sure what the theoretical differences here are. You have to iterate over the entire string and apply your condition. The rest are implementation details that are very much a matter of measurement. They'll also tend to dominate the running time (i.e. how many function calls you make, how many new strings you instantiate and instantly turn to garbage, etc). Commented Jun 29, 2017 at 13:46
  • Your option 1 may be slow because you are calculating len(string) every loop. Commented Jun 29, 2017 at 14:21
  • Now that you're measuring things, try taking out the function call as an experiment - you'll see it completely dominates the time taken, making your option2 (and the one in the join answer below) nearly two times faster. Commented Jun 29, 2017 at 14:51

2 Answers 2

3

String concatenation with += is not recommended. Better use a list and join it.

def option2(string):
    result  = []
    state = 0
    for c in string:
        answer, state = fun(c, state)
        if answer:
             result.append(c)
   return ''.join(result)

This a good explanation why. While += is optimized in CPython (for certain cases?), it can be really slow in other implementations such as PyPy, for example.

Sign up to request clarification or add additional context in comments.

1 Comment

indeed, the time elapsed is 38 ms - the best by now
1

In CPython, a good performance rule of thumb is 'avoid function calls in tight loops'. The overhead tends to take a significant amount of time compared to simple iterative computations. In your case, a 'brute force' approach is to simply inline the logic in the loop that traverses your string:

def option4(strng):
    result  = []
    state = 0
    for c in strng:
        if state % 5 == 0:
            state += 1
        else:
            result.append(c)
            state += 2
    return ''.join(result)

This may feel a little clunky, a more 'pythonic' approach might be to use a python generator that takes an iterable and initial state as input and yields the matching members.

def filtergen(iter, initstate):
    state = initstate
    for c in iter:
        if state % 5 == 0:
            state += 1
        else:
            state += 2
            yield c

def option5(strng):
    return ''.join(filtergen(strng, 0))

This reduces call overhead - you only need as many calls as matching elements and as a side benefit, frees the caller from the need to keep track of state. Additionally, you no longer need the intermediate list. The filter remains generic - it works on any kind of iterable, rather than just strings.

Putting all of these together and measuring performance:

s = 'T0kzcZ0x8VQY8ulgFKU8D1MlIUULRsVsNZMnXbjliUEES6sEIVUzpjxlHLG59z' * 1200

def fun(c, state):
    if state % 5 == 0:
        return False, state + 1
    else:
        return True, state + 2

def option3(strng):
    result  = []
    state = 0
    for c in strng:
        answer, state = fun(c, state)
        if answer:
             result.append(c)
    return ''.join(result)

def option4(strng):
    result  = []
    state = 0
    for c in strng:
        if state % 5 == 0:
            state += 1
        else:
            result.append(c)
            state += 2
    return ''.join(result)

def filtergen(iter, initstate):
    state = initstate
    for c in iter:
        if state % 5 == 0:
            state += 1
        else:
            state += 2
            yield c

def option5(strng):
    return ''.join(filtergen(strng, 0))

import timeit
print("string length is " + str(len(s)))
print(timeit.timeit('option3(s)', globals=globals(), number=100))
print(timeit.timeit('option4(s)', globals=globals(), number=100))
print(timeit.timeit('option5(s)', globals=globals(), number=100))

print(option3(s) == option(4) == option5(s))

On my dinky laptop, the output looks like this (times are in seconds):

pvg /tmp ➤  python3 s.py
string length is 74400
2.488818967998668
1.3291878789968905
1.268602224998176
True

The 'inline logic' and generator both easily outperform the naive call-per-element approach.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.