0

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv

article = ['The computer code has a little bug' ,
      'im learning python' ,
           'thanks for helping me' ,
            'this is trouble' ,
          'this is a sample sentence'
            'cat in the hat']

tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
2
  • 1
    As general advice, it's useful to simply print out the values you currently have between lines to see what's being sent to each successive line. Commented May 26, 2016 at 21:05
  • Thanks and i have tried that without any luck !! Commented May 26, 2016 at 21:09

2 Answers 2

3

Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:

stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]

Instead, iterate again through words. Something like:

clean_models = []
for m in tokenized_models:
    stop_m = [i for i in m if str(i).lower() not in stopset]
    clean_models.append(stop_m)

print(clean_models)

Off-topic useful hint:
To define a multi-line string, use brackets and no comma:

article = ('The computer code has a little bug'
           'im learning python'
           'thanks for helping me'
           'this is trouble'
           'this is a sample sentence'
           'cat in the hat')

This version would work with your original code

Sign up to request clarification or add additional context in comments.

Comments

0

word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.

This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:

>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.