39

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'], get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly.

Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s.

Is there a way to handle this in a pythonic way? I could only think of some step-by-step algorithm to get it, but that would not include get_dummies(). Thanks

Edited, hope it is more clear!

3
  • could you not just select the columns you want to call get_dummies() on? like df[['A', 'B', 'C','D']].get_dummies()? Commented Sep 19, 2013 at 8:24
  • A, B, C, C*D, and so on, are the values of a single column. Commented Sep 19, 2013 at 8:46
  • OK understand, how about df[df.col.isin(['A','B','C'])].get_dummies() would this work? this would filter out the values you did not want to generate dummy values for Commented Sep 19, 2013 at 9:42

4 Answers 4

81

I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:

In [4]: df
Out[4]:
      label
0  (a, c, e)
1     (a, d)
2       (b,)
3     (d, e)

In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
Out[5]:
   a  b  c  d  e
0  1  0  1  0  1
1  1  0  0  1  0
2  0  1  0  0  0
3  0  0  0  1  1
Sign up to request clarification or add additional context in comments.

4 Comments

+1: You sir... are some sort of wizard! Just what I needed.. thx :)
Based on my question I think df['label'].str.get_dummies(sep='*') is enough. I guess the first part would be needed if the strings weren't already formated as the get_dummies function expects them to be
The answer is quite old but, however I am strugling to understand how is works. Could anyone comment on the structure ? I have a similar case where each row is a list of strings. The solution works like a charm but I'd like to understand why....
Oustanding solution @offbyone
7

I have a somewhat cleaner solution. Assume we want to transform the following dataframe

   pageid category
0       0        a
1       0        b
2       1        a
3       1        c

into

        a  b  c
pageid         
0       1  1  0
1       1  0  1

One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.

df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))

grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
category_dicts = [dict(tuples) for tuples in grouped]
v = sklearn.feature_extraction.DictVectorizer(sparse=False)
X = v.fit_transform(category_dicts)

pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)

Comments

4

You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.

df
Out[28]: 
  label
0     A
1     B
2     C
3     D
4   A*C
5   C*D

dummies = pd.get_dummies(df['label'])

atom_col = [c for c in dummies.columns if '*' not in c]

for col in atom_col:
    ...:     df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
    ...:     

df
Out[32]: 
  label  A  B  C  D
0     A  1  0  0  0
1     B  0  1  0  0
2     C  0  0  1  0
3     D  0  0  0  1
4   A*C  1  0  1  0
5   C*D  0  0  1  1

3 Comments

Works like a charm! (although you need to edit 'd' into 'dummies' in the code) thanks a lot!
I just found that the solution does not work if the values of the 'label' variable are of the form [A,B,C,D,A*C,C*D,E*F]. The way I got it to work anyway was by doing something like: atom_col = '*'.join(pd.Series(df['label'])).split('*').unique() (it's ugly in 1 line but the steps are all there)
you can also use a set comprehension: atom_col = {c for raw_c in dummies.columns for c in raw_c.split('*') if '*' not in c}
4

I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.

The usage of this is as simple as...

# Instantiate the binarizer
mlb = MultiLabelBinarizer()

# Using OP's original data frame
df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])

print(df)
  label
0     A
1     B
2     C
3     D
4   A*C
5   C*D

# Convert to a list of labels
df = df.apply(lambda x: x["label"].split("*"), axis=1)

print(df)
0       [A]
1       [B]
2       [C]
3       [D]
4    [A, C]
5    [C, D]
dtype: object

# Transform to a binary array
array_out = mlb.fit_transform(df)

print(array_out)
[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 1 0]
 [0 0 1 1]]

# Convert back to a dataframe (unnecessary step in many cases)
df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)

print(df_out)
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  1  0
5  0  0  1  1

This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.