DecisionTreeClassifier on multiple levels

Question

I am trying to classify objects that have multiple levels. The best way I can explain it is with an example:

I can do this:

from sklearn import tree
features = ['Hip Hop','Boston'],['Metal', 'Cleveland'],['Gospel','Ohio'],['Grindcore','Agusta']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

But I want to do this:

from sklearn import tree
features = ['Hip Hop','Boston',['Run DMC','Kanye West']],['Metal', 'Cleveland',['Guns n roses','Poison']],['Gospel','Ohio',['Christmania','I Dream of Jesus']],['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]
labels = [1,0,0,0]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
clf.predict_proba(<blah blah>)

I am trying to assign a probability that a person will enjoy a band based on their location, favorite genre, and other bands they like.

Are these continuous or categorical features? Seems to me like, in the case of the first element in your desired features ([1,2,[1,2,3]]), you could one-hot-encode the [1,2,3] portion, basically encoding that that row has the attribute of [1,2,3], whatever that particular combination of levels corresponds to... — boot-scootin
– boot-scootin, Commented Nov 9, 2017 at 15:54
It seems that each observation has some singleton attributes (like 1 or 2 by themselves, but also an interaction, like [6, 7, 3]... it would be really helpful if you could define (through a minimal example) what you're hoping to accomplish and to what each of the numbers in features correspond... — boot-scootin
– boot-scootin, Commented Nov 9, 2017 at 15:59
BTW, you know that decision trees will account for interactions between variables/levels if explicitly programmed, right? — boot-scootin
– boot-scootin, Commented Nov 9, 2017 at 16:04
@blacksite I updated the question to be a little more specific. This is my first go at a machine learning problem if you couldn't already tell :) — user2258854
– user2258854, Commented Nov 9, 2017 at 16:20
List of values in a single cell of feature should be converted to one-hot encoded columns using Multi-labelBinarizer. — Vivek Kumar
– Vivek Kumar, Commented Nov 9, 2017 at 16:38

David Dale · Accepted Answer · 2017-11-09 16:42:58Z

1

You have a simple solution: just turn each band into a binary feature (you can use MultiLabelBinarizer or something similar). Your X matrix just before feeding it into a tree will look like this:

You could create such a matrix with this code:

import pandas as pd
features = [['Hip Hop','Boston',['Run DMC','Kanye West']],
            ['Metal', 'Cleveland',['Guns n roses','Poison']],
            ['Gospel','Ohio',['Christmania','I Dream of Jesus']],
            ['Grindcore','Agusta', ['Pig Destroyer', 'Carcas', 'Cannibal Corpse']]]
df = pd.DataFrame([{**{f[0]:1, f[1]:1}, **{k:1 for k in f[2]}} for f in features]).fillna(0)

If the number of bands is low, binary encoding will suffice. But if there are too many bands, you might want to reduce dimensionality. You can accomplish it with the following steps:

Create the user-bands count matrix, like above
(Optionally) normalize it e.g. with tf-idf
Apply a matrix decomposition algorithm to it to extract the "latent features" from the matrix.
Feed the latent features to your decision tree (or any other estimator).

If the number of bands is large, but you have too few observations, even matrix decomposition may not help much. If it is the case, the only advice is to use simpler features, e.g. replace the groups with their corresponding genres.

answered Nov 9, 2017 at 16:42

David Dale

11.5k48 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2258854 Over a year ago

thank you for your reply. This makes a lot of sense to me and I'll give it a try. One more question: how many bands is too many? On the order of tens or hundreds or thousands?

David Dale Over a year ago

It depends on the band frequencies and distribution of classes that you predict. Without such information, I would roughly estimate the 'too many' as min(10000, num_samples/10)

Collectives™ on Stack Overflow

DecisionTreeClassifier on multiple levels

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related