2

From my profiling, I could see this function takes more time to process. How do I speed up this code? My dataset has more than million records and this stopword list I have given here is just a sample - it actually contains 150 words.

def remove_if_name_v1(s):
    stopwords = ('western spring','western sprin','western spri','western spr','western sp','western s',
                 'grey lynn','grey lyn','grey ly','grey l')
    for word in stopwords:
        s = re.sub(r'(' + word + r'.*?|.*?)\b' + word + r'\b', r'\1', s.lower(), 1)
    return s.title()

test.new_name = test.old_name.apply(lambda x: remove_if_name_v2(x) if pd.notnull(x) else x)

Seems the function is run for each row in the data frame and in each row, it runs the for loop as many times as the stop words. Is there any alternative approach?

What I am trying to do here is example, if the string contains "western spring road western spring", this function will return "western spring road".

Thanks.

8
  • I assume there are million 'test's and 150 'stopwords'. In this case. you can precompile the regex of the 150 stopwords Commented Jul 14, 2017 at 1:02
  • Cool. How do I do that and apply here? Commented Jul 14, 2017 at 1:03
  • @ds_user can you post some more input/output samples? I am suspicious of your stopwords list. You try to substitute e.g. both (western s.*?) (western spring.*?) where the first case covers the second. Commented Jul 14, 2017 at 1:13
  • @ds_user, also I think .*? is redundant, and can be changed to just .*. Commented Jul 14, 2017 at 1:14
  • 1
    You may want to validate that the stop words do not contain special characters like '.,*,{},(),[]' as those can interfere with regex matching Commented Jul 14, 2017 at 1:24

2 Answers 2

3

You can combine and pre-compile the regex for a fairly big improvement.

stopwords = ('western spring',
             'western sprin',
             'western spri',
             'western spr',
             'western sp',
             'western s',
             'grey lynn',
             'grey lyn',
             'grey ly',
             'grey l')

pat = re.compile(r'(?P<repl>(?P<word>{stopwords}).*?|.*?)\b(?P=word)\b'.format(
                 stopwords='|'.join(re.escape(s) for s in stopwords)))

test.old_name.str.replace(pat, '\g<repl>')

Note the (?P=word) back-reference. I've also used Series.str.replace instead of Series.apply, which is slightly cleaner.

Sign up to request clarification or add additional context in comments.

10 Comments

Not only is it cleaner, but likely Series.str.replace is faster than apply because it is vectorized.
@SethMMorton, in my experiments, it actually wasn't much faster. I think the implementation of str.replace is fairly close to apply(lambda s: ...), since the regex engine is Python-side.
You need a list or tuple instead of set if you want the largest match
Is see. Well, at least it encourages the good pandas practice of using vectorized methods instead of hand-rolled functions when possible, since apply on a DataFrame is definitely slow.
To clarify @balki 's comment, the set is losing the order of the stopwords and so it's possible that a shorter word will be chosen if it ends up first. Using a list will correct that.
|
2

One quick improvement is to put the stop words in a set. When checking, multiple words it will result in a constant time O(1) lookup.

STOP_WORDS = {
    'western spring',
    'western sprin',
    'western spri',
    'western spr',
    'western sp',
    'western s',
    'grey lynn',
    'grey lyn',
    'grey ly',
    'grey l'
}

def find_first_stop(words):
    if len(words) == 0:
        return False
    joined = ' '.join(reversed(words))
    if joined in STOP_WORDS:
        return True
    return find_first_stop(words[:-len(words) - 1])

def remove_if_name_v1(s):
    if s in STOP_WORDS:
        return s

    words = []
    split_words = s.split(' ')
    for word in reversed(split_words):
        words.append(word)
        if find_first_stop(words):
            words = []
    return ' '.join(reversed(words))

old_name = pd.Series(['western spring road western spring', 'kings road western spring', 'western spring'])
new_name = old_name.apply(lambda x: remove_if_name_v1(x) if pd.notnull(x) else x)
print(new_name)

Output:

0    western spring road
1             kings road
2         western spring
dtype: object

7 Comments

Set is a good idea though, but this doesn't work as expected I suspect. Because this will result just "road" for the input "western spring road western spring". But what I am trying to do here is to just remove the second occurrence and bring everything before that.
Ok, I see what you want. What is 's'? It could be split into words and then pass through the filter one by one and be rejoined if it's not a stop word.
Problem with splitting approach is that we can't catch multiple words if we split them. 's' is a string, an example "western spring road western spring". If we split, we can't catch "western spring" on this.
I get it. It's a tricky problem because your stop words are phrases not just words. I'm thinking of a fast solution and will come back with it.
@ds_user I wrote this algorithm which passes your test, but I'm not sure how fast it is.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.