1

I am trying to classify objects that have multiple levels. The best way I can explain it is with an example:

I can do this:

from sklearn import tree
features = ['Hip Hop','Boston'],['Metal', 'Cleveland'],['Gospel','Ohio'],['Grindcore','Agusta']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

But I want to do this:

from sklearn import tree
features = ['Hip Hop','Boston',['Run DMC','Kanye West']],['Metal', 'Cleveland',['Guns n roses','Poison']],['Gospel','Ohio',['Christmania','I Dream of Jesus']],['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
clf.predict_proba(<blah blah>)

I am trying to assign a probability that a person will enjoy a band based on their location, favorite genre, and other bands they like.

5
  • 1
    Are these continuous or categorical features? Seems to me like, in the case of the first element in your desired features ([1,2,[1,2,3]]), you could one-hot-encode the [1,2,3] portion, basically encoding that that row has the attribute of [1,2,3], whatever that particular combination of levels corresponds to... Commented Nov 9, 2017 at 15:54
  • It seems that each observation has some singleton attributes (like 1 or 2 by themselves, but also an interaction, like [6, 7, 3]... it would be really helpful if you could define (through a minimal example) what you're hoping to accomplish and to what each of the numbers in features correspond... Commented Nov 9, 2017 at 15:59
  • BTW, you know that decision trees will account for interactions between variables/levels if explicitly programmed, right? Commented Nov 9, 2017 at 16:04
  • @blacksite I updated the question to be a little more specific. This is my first go at a machine learning problem if you couldn't already tell :) Commented Nov 9, 2017 at 16:20
  • List of values in a single cell of feature should be converted to one-hot encoded columns using Multi-labelBinarizer. Commented Nov 9, 2017 at 16:38

1 Answer 1

1

You have a simple solution: just turn each band into a binary feature (you can use MultiLabelBinarizer or something similar). Your X matrix just before feeding it into a tree will look like this:

binary matrix

You could create such a matrix with this code:

import pandas as pd
features = [['Hip Hop','Boston',['Run DMC','Kanye West']],
            ['Metal', 'Cleveland',['Guns n roses','Poison']],
            ['Gospel','Ohio',['Christmania','I Dream of Jesus']],
            ['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]]
df = pd.DataFrame([{**{f[0]:1, f[1]:1}, **{k:1 for k in f[2]}} for f in features]).fillna(0)

If the number of bands is low, binary encoding will suffice. But if there are too many bands, you might want to reduce dimensionality. You can accomplish it with the following steps:

  1. Create the user-bands count matrix, like above
  2. (Optionally) normalize it e.g. with tf-idf
  3. Apply a matrix decomposition algorithm to it to extract the "latent features" from the matrix.
  4. Feed the latent features to your decision tree (or any other estimator).

If the number of bands is large, but you have too few observations, even matrix decomposition may not help much. If it is the case, the only advice is to use simpler features, e.g. replace the groups with their corresponding genres.

Sign up to request clarification or add additional context in comments.

2 Comments

thank you for your reply. This makes a lot of sense to me and I'll give it a try. One more question: how many bands is too many? On the order of tens or hundreds or thousands?
It depends on the band frequencies and distribution of classes that you predict. Without such information, I would roughly estimate the 'too many' as min(10000, num_samples/10)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.