0

I have a text file with a sentence on each line: eg ""Have you registered your email ID with your Bank Account?"

I want to classify it into interrogative or not. FYI these are sentences from bank websites. I've seen this answer with this nltk code block:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

So I did some preprocessing to my text file i.e. stemming words, removing stop words etc, to make each sentence into a bag of words. From the code above, I have a trained classifier. How do I implement it on my text file of sentences (either raw or preprocessed)?

Update: here is an example of my text file.

11
  • You need to convert the documents using (scikit-learn.org/stable/modules/generated/…) and then use the classifier. Can you upload your data? Commented May 29, 2018 at 8:37
  • @seralouk thank you for your response, I will look at the link now! I have updated the question with an example of my data. Commented May 29, 2018 at 9:09
  • not sure why I'm being downvoted, is there any more information I should be providing? Commented May 29, 2018 at 9:10
  • @seralouk no they are all strings of sentences. I have given the preprocessed version. If you want I can attach the processed version where numbers are taken out, words are stemmed, and stopwords are removed? Commented May 29, 2018 at 9:12
  • @seralouk can't I train the classifier using nps_chat and get the sample data from that? Commented May 29, 2018 at 9:13

2 Answers 2

1

Assuming that you have preprocessed the document data as we discussed, you can do the following:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))

0.668

For your data, you can iterate in your lines and fit, predict:

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))
Sign up to request clarification or add additional context in comments.

Comments

0

Doing this for all lines in the text file works:

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.