speed up this for loop - python

Question

From my profiling, I could see this function takes more time to process. How do I speed up this code? My dataset has more than million records and this stopword list I have given here is just a sample - it actually contains 150 words.

def remove_if_name_v1(s):
    stopwords = ('western spring','western sprin','western spri','western spr','western sp','western s',
                 'grey lynn','grey lyn','grey ly','grey l')
    for word in stopwords:
        s = re.sub(r'(' + word + r'.*?|.*?)\b' + word + r'\b', r'\1', s.lower(), 1)
    return s.title()

test.new_name = test.old_name.apply(lambda x: remove_if_name_v2(x) if pd.notnull(x) else x)

Seems the function is run for each row in the data frame and in each row, it runs the for loop as many times as the stop words. Is there any alternative approach?

What I am trying to do here is example, if the string contains "western spring road western spring", this function will return "western spring road".

Thanks.

I assume there are million 'test's and 150 'stopwords'. In this case. you can precompile the regex of the 150 stopwords — balki
– balki, Commented Jul 14, 2017 at 1:02
@ds_user can you post some more input/output samples? I am suspicious of your stopwords list. You try to substitute e.g. both (western s.*?) (western spring.*?) where the first case covers the second. — Igor Raush
– Igor Raush, Commented Jul 14, 2017 at 1:13
@ds_user, also I think .*? is redundant, and can be changed to just .*. — Igor Raush
– Igor Raush, Commented Jul 14, 2017 at 1:14
You may want to validate that the stop words do not contain special characters like '.,*,{},(),[]' as those can interfere with regex matching — balki
– balki, Commented Jul 14, 2017 at 1:24

Igor Raush · Accepted Answer · 2017-07-14 01:41:27Z

3

You can combine and pre-compile the regex for a fairly big improvement.

stopwords = ('western spring',
             'western sprin',
             'western spri',
             'western spr',
             'western sp',
             'western s',
             'grey lynn',
             'grey lyn',
             'grey ly',
             'grey l')

pat = re.compile(r'(?P<repl>(?P<word>{stopwords}).*?|.*?)\b(?P=word)\b'.format(
                 stopwords='|'.join(re.escape(s) for s in stopwords)))

test.old_name.str.replace(pat, '\g<repl>')

Note the (?P=word) back-reference. I've also used Series.str.replace instead of Series.apply, which is slightly cleaner.

edited Jul 14, 2017 at 1:41

answered Jul 14, 2017 at 1:05

Igor Raush

15.3k1 gold badge38 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

SethMMorton Over a year ago

Not only is it cleaner, but likely Series.str.replace is faster than apply because it is vectorized.

Igor Raush Over a year ago

@SethMMorton, in my experiments, it actually wasn't much faster. I think the implementation of str.replace is fairly close to apply(lambda s: ...), since the regex engine is Python-side.

balki Over a year ago

You need a list or tuple instead of set if you want the largest match

SethMMorton Over a year ago

Is see. Well, at least it encourages the good pandas practice of using vectorized methods instead of hand-rolled functions when possible, since apply on a DataFrame is definitely slow.

SethMMorton Over a year ago

To clarify @balki 's comment, the set is losing the order of the stopwords and so it's possible that a shorter word will be chosen if it ends up first. Using a list will correct that.

|

John Scattergood · Accepted Answer · 2017-07-14 06:40:03Z

2

One quick improvement is to put the stop words in a set. When checking, multiple words it will result in a constant time O(1) lookup.

STOP_WORDS = {
    'western spring',
    'western sprin',
    'western spri',
    'western spr',
    'western sp',
    'western s',
    'grey lynn',
    'grey lyn',
    'grey ly',
    'grey l'
}

def find_first_stop(words):
    if len(words) == 0:
        return False
    joined = ' '.join(reversed(words))
    if joined in STOP_WORDS:
        return True
    return find_first_stop(words[:-len(words) - 1])

def remove_if_name_v1(s):
    if s in STOP_WORDS:
        return s

    words = []
    split_words = s.split(' ')
    for word in reversed(split_words):
        words.append(word)
        if find_first_stop(words):
            words = []
    return ' '.join(reversed(words))

old_name = pd.Series(['western spring road western spring', 'kings road western spring', 'western spring'])
new_name = old_name.apply(lambda x: remove_if_name_v1(x) if pd.notnull(x) else x)
print(new_name)

Output:

0    western spring road
1             kings road
2         western spring
dtype: object

edited Jul 14, 2017 at 6:40

answered Jul 14, 2017 at 0:36

John Scattergood

1,0428 silver badges15 bronze badges

7 Comments

ds_user Over a year ago

Set is a good idea though, but this doesn't work as expected I suspect. Because this will result just "road" for the input "western spring road western spring". But what I am trying to do here is to just remove the second occurrence and bring everything before that.

John Scattergood Over a year ago

Ok, I see what you want. What is 's'? It could be split into words and then pass through the filter one by one and be rejoined if it's not a stop word.

ds_user Over a year ago

Problem with splitting approach is that we can't catch multiple words if we split them. 's' is a string, an example "western spring road western spring". If we split, we can't catch "western spring" on this.

John Scattergood Over a year ago

I get it. It's a tricky problem because your stop words are phrases not just words. I'm thinking of a fast solution and will come back with it.

John Scattergood Over a year ago

@ds_user I wrote this algorithm which passes your test, but I'm not sure how fast it is.

|

Collectives™ on Stack Overflow

speed up this for loop - python

2 Answers 2

10 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related