11

I have a set of Books objects, classs Book is defined as following :

Class Book{

String title;
ArrayList<tags> taglist;

}

Where title is the title of the book, example : Javascript for dummies.

and taglist is a list of tags for our example : Javascript, jquery, "web dev", ..

As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it..

I have to classify automaticaly those books into separated sets by topic, example :

IT BOOKS :

  • Java for dummies
  • Javascript for dummies
  • Learn flash in 30 days
  • C++ programming

HISTORY BOOKS :

  • World wars
  • America in 1960
  • Martin luther king's life

BIOLOGY BOOKS :

  • ....

Do you guys know a classification algorithm/method to apply for that kind of problems ?

A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..

8
  • Yes, but there are some common tags between the books :( Commented May 12, 2010 at 19:03
  • Related question: stackoverflow.com/questions/2781752/… Commented May 12, 2010 at 19:26
  • Sigh..the answers for this question seem to be all over the place with some of them mistakenly (?) treating it as a simple homework question on basic data structures. Yox, could you confirm that this is a text-classification task where you want to take books tagged with keywords and then use some sort of classification algorithm to map the books to the appropriate topics? Commented May 13, 2010 at 0:58
  • have you been already given the category for each book (i dont mean tags) ? Commented May 13, 2010 at 5:43
  • @dmcer : this is exactly what i want to do, data is stored in a relational DB i'm creating Book objects from it. @adi92 : No, i don't have the catogory title, i have to guess/generate it automaticaly Commented May 13, 2010 at 13:18

4 Answers 4

29

This looks like a reasonably straightforward keyword-based classification task. Since you're using Java, good packages to consider for this would be Classifier4J, Weka, or Lucene Mahout.

Classifier4J

Classifier4J supports classification using naive Bayes and a vector space model.

As seen in this source code snippet on training and scoring using its naive Bayes classifier, the package is reasonably easy to use. It's also distributed under the liberal Apache Software License.

Weka

Weka is a very popular tool for data mining. An advantage of using it is that you'd be able to readily experiment with using numerous different machine learning models to categorize the books into topics including naive Bayes, decision trees, support vector machines, k-nearest neighbor, logistic regression, and even a rule set based learner.

You'll find a tutorial on using Weka for text categorization here.

Weka is, however, distributed under the GPL. You won't be able to use it for closed source software that you want to distribute. But, you could still use it to back a web service.

Lucene Mahout

Mahout is designed for doing machine learning on very large datasets. It's built on top of Apache Hadoop and supports supervised classification using naive Bayes.

You'll find a tutorial covering how to use Mahout for text classification here.

Like Classifier4J, Mahout is distributed under the liberal Apache Software License.

Sign up to request clarification or add additional context in comments.

1 Comment

used classfier4j , VectorClassifier worked the best for me
1

Do you not want something as simple as this?

Map<Tag, ArrayList<Book>> m = {};
for (Book b : books) {
    for (tag t : b.taglist) {
        m.get(t).add(b);
    }
}

Now m.get("IT") will return all IT books, etc...

Sure some books will appear in multiple categories, but that happens in real life, too...

3 Comments

No, tags is if you want a list of significant words in the book.. it will help for grouping books.
@yox: ah so you want to classify the topic based on the set of tags the book has? or based on the book text? and the tags / book text can be in different languages?
exaclty, i want to classify using only tags and the tags are in different languages.
1

So you are looking to make a Map of Tags that holds a Collection of Books?

EDIT:

Sounds like you might want to take a look at a Vector Space Model to apply classification of categories.

Either Lucene or Classifier4j offer a framework for this.

4 Comments

I'm looking to make a map of books objects where the key is the category name.
@yox: Make that map. That's your answer.
I don't want books by tag .. I want books by topic (the map key) which is not present here, it will be an artificialy generated string
@yox: Sorry, I misunderstood.
-1

You might want to look up fuzzy matching algorithms such as Soundex and Levenshtein.

2 Comments

Cool, this is a great way to calculate distance between 2 strings, thank you
Could you elaborate on how you would use Soundex and Levenshtein to map lists of keywords to topics?