Removing stopwords from list using python3

Question

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv

article = ['The computer code has a little bug' ,
      'im learning python' ,
           'thanks for helping me' ,
            'this is trouble' ,
          'this is a sample sentence'
            'cat in the hat']

tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))

As general advice, it's useful to simply print out the values you currently have between lines to see what's being sent to each successive line. — Akshat Mahajan
– Akshat Mahajan, Commented May 26, 2016 at 21:05

Iulius Curt · Accepted Answer · 2016-05-26 21:15:49Z

3

Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:

stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]

Instead, iterate again through words. Something like:

clean_models = []
for m in tokenized_models:
    stop_m = [i for i in m if str(i).lower() not in stopset]
    clean_models.append(stop_m)

print(clean_models)

Off-topic useful hint:
To define a multi-line string, use brackets and no comma:

article = ('The computer code has a little bug'
           'im learning python'
           'thanks for helping me'
           'this is trouble'
           'this is a sample sentence'
           'cat in the hat')

This version would work with your original code

edited May 26, 2016 at 21:15

answered May 26, 2016 at 21:08

Iulius Curt

5,1244 gold badges34 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex Hall · Accepted Answer · 2016-05-26 21:09:26Z

0

word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.

This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:

>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

answered May 26, 2016 at 21:09

Alex Hall

36.2k5 gold badges63 silver badges98 bronze badges

Collectives™ on Stack Overflow

Removing stopwords from list using python3

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related