0

I have a list of strings and i want to remove the stop words inside each string. The thing is, the length of the stopwords is much longer than the strings and I don't want to repeat comparing each string with the stopwords list. Is there a way in python that these multiple strings at the same time?

lis = ['aka', 'this is a good day', 'a pretty dog']
stopwords = [] # pretty long list of words
for phrase in lis:
    phrase = phrase.split(' ') # get list of words
    for word in phrase:
        if stopwords.contain(word):
            phrase.replace(word, '')

This is my current method. But these means I have to go through all the phrases in the list. Is there a way that I can process these phrases with only one time compare?

Thanks.

5
  • How long is "long"? If it's less than 100,000 elements, I wouldn't worry about it. Especially if you make stopwords into a set, as x in set checking is very fast. Commented Dec 5, 2014 at 16:26
  • a nested list comprehension statement would maybe be nicer(or more confusing? ) to look at, but this is pretty much the best way i can see to do this Commented Dec 5, 2014 at 16:28
  • @Kevin Well, it's 100, 000 long but still don't want to check like multiple times.. Commented Dec 5, 2014 at 16:29
  • you have to check if each phrase has to be checked and as kevin said using a set would make lookups 0(1) Commented Dec 5, 2014 at 16:30
  • 2
    Some complexity comparisons show that checking for x in stopwords is linear in time if stopwords is a list and constant in time if it is a set (as Kevin said). In other words, with a set, you (almost) wouldn't feel the difference between a little one and a huge one (it's fast in both case). Commented Dec 5, 2014 at 16:36

2 Answers 2

3

This is the same idea, but with a few improvements. Convert your list of stopwords to a set for faster lookups. Then you can iterate over your phrase list in a list comprehension. You can then iterate over the words in the phrase, and keep them if they're not in the stop set, then join the phrase back together.

>>> lis = ['aka', 'this is a good day', 'a pretty dog']
>>> stopwords = ['a', 'dog']
>>> stop = set(stopwords)
>>> [' '.join(j for j in i.split(' ') if j not in stop) for i in lis]
['aka', 'this is good day', 'pretty']
Sign up to request clarification or add additional context in comments.

Comments

1

You could compute the difference between the list formed by each phrase and the stop words.

>>> lis = ['aka', 'this is a good day', 'a pretty dog']
>>> stopwords = ['a', 'dog']

>>> stop = set(stopwords)
>>> result = map(lambda phrase: " ".join(list( set(phrase.split(' ')) - stop)), lis)
>>> print( result )

['aka', 'this is good day', 'pretty']

1 Comment

That actually messes up the order of the words in the phrases since you make a set out the split. with lis = ['a b c d e f g'] it gives ['c b e d g f'].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.