0

I'm currently using Naive Bayes to classify a bunch of texts. I have multiple categories. Right now I just output the posterior probability and the category, but what I would like to do is rank the categories based on the posterior probabilities and use the 2nd, 3rd place categories as "back up" categories.

Here's an example:

df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})

text           true_cat
-----------------------
I have wings   bird
Metal wings    plane
Feathers       bird
Airport        plane

What I'm doing:

new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))

Eventual Output:

new_cat prob_cat    text           true_cat
bird    0.67        I have wings   bird
bird    0.6         Feathers       bird
bird    0.51        Metal wings    plane
plane   0.8         Airport        plane

I have found a couple examples using classify_many and prob_classify_many but since I'm new to Python I'm having trouble translating it to my problem. I haven't seen it used with pandas anywhere.

I want it to look like this:

df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})


new_cat1    new_cat2    prob_cat1   prob_cat2   text           true_cat
-----------------------------------------------------------------------
bird        plane       0.67        0.33        I have wings   bird
bird        plane       0.51        0.49        Metal wings    plane
bird        plane       0.6         0.4         Feathers       bird
plane       bird        0.8         0.2         Airport        plane

Any help would be appreciated.

2 Answers 2

1

I'm treating your self-answer as part of your question. Presumably you got the probability of the classification bird like this:

prob_cat.prob("bird")

Here, prob_cat is an nltk probability distribution (ProbDist). You can get all categories in a discrete ProbDist and their probability like this:

probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())

Since you already know the categories you trained with, you can use a predefined list instead of prob_cat.samples(). Finally, you can order them from the most to the least probable in the same expression:

mycategories = ["bird", "plane"]
probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])
Sign up to request clarification or add additional context in comments.

Comments

0

I'm starting to get there now.

#This gives me the probability it's a bird.
prob_cat.prob(bird)

#This gives me the probability it's a plane.
prob_cat.prob(plane)

Now since I have dozens of categories I'm working on a good way to have it give me all of them without putting in all of the category names, but that should be pretty simple.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.