Classifying text strings into multiple classes using Naive Bayes with NLTK

Question

I'm currently using Naive Bayes to classify a bunch of texts. I have multiple categories. Right now I just output the posterior probability and the category, but what I would like to do is rank the categories based on the posterior probabilities and use the 2nd, 3rd place categories as "back up" categories.

Here's an example:

df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})

text           true_cat
-----------------------
I have wings   bird
Metal wings    plane
Feathers       bird
Airport        plane

What I'm doing:

new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))

Eventual Output:

new_cat prob_cat    text           true_cat
bird    0.67        I have wings   bird
bird    0.6         Feathers       bird
bird    0.51        Metal wings    plane
plane   0.8         Airport        plane

I have found a couple examples using classify_many and prob_classify_many but since I'm new to Python I'm having trouble translating it to my problem. I haven't seen it used with pandas anywhere.

I want it to look like this:

df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})


new_cat1    new_cat2    prob_cat1   prob_cat2   text           true_cat
-----------------------------------------------------------------------
bird        plane       0.67        0.33        I have wings   bird
bird        plane       0.51        0.49        Metal wings    plane
bird        plane       0.6         0.4         Feathers       bird
plane       bird        0.8         0.2         Airport        plane

Any help would be appreciated.

alexis · Accepted Answer · 2016-12-06 21:58:28Z

1

I'm treating your self-answer as part of your question. Presumably you got the probability of the classification bird like this:

prob_cat.prob("bird")

Here, prob_cat is an nltk probability distribution (ProbDist). You can get all categories in a discrete ProbDist and their probability like this:

probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())

Since you already know the categories you trained with, you can use a predefined list instead of prob_cat.samples(). Finally, you can order them from the most to the least probable in the same expression:

mycategories = ["bird", "plane"]
probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])

answered Dec 6, 2016 at 21:58

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

J Sedai · Accepted Answer · 2016-12-05 20:35:22Z

0

I'm starting to get there now.

#This gives me the probability it's a bird.
prob_cat.prob(bird)

#This gives me the probability it's a plane.
prob_cat.prob(plane)

Now since I have dozens of categories I'm working on a good way to have it give me all of them without putting in all of the category names, but that should be pretty simple.

answered Dec 5, 2016 at 20:35

J Sedai

1151 silver badge9 bronze badges

Collectives™ on Stack Overflow

Classifying text strings into multiple classes using Naive Bayes with NLTK

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related