From my profiling, I could see this function takes more time to process. How do I speed up this code? My dataset has more than million records and this stopword list I have given here is just a sample - it actually contains 150 words.
def remove_if_name_v1(s):
stopwords = ('western spring','western sprin','western spri','western spr','western sp','western s',
'grey lynn','grey lyn','grey ly','grey l')
for word in stopwords:
s = re.sub(r'(' + word + r'.*?|.*?)\b' + word + r'\b', r'\1', s.lower(), 1)
return s.title()
test.new_name = test.old_name.apply(lambda x: remove_if_name_v2(x) if pd.notnull(x) else x)
Seems the function is run for each row in the data frame and in each row, it runs the for loop as many times as the stop words. Is there any alternative approach?
What I am trying to do here is example, if the string contains "western spring road western spring", this function will return "western spring road".
Thanks.
(western s.*?)(western spring.*?)where the first case covers the second..*?is redundant, and can be changed to just.*.