Split a pandas DataFrame column into OneHot/Binary columns

Question

I have a DataFrame I'm formatting for an SciKit Learn PCA looks something like this:

datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"

...and so on

I'd like to transform it to this, which I think will be better for ML work:

datetime |  mood |  friends | party | gaming | work | good food | travel |  notes

8/27/2017 |  "good" | True | True | True | False | False | False | NaN

8/28/2017 |  "meh" |  True | False | False | True | True | False | "Stuff stuff"

8/29.2017 | "bad" | False | False | False | False | True | False | True | "Fell off my bike"

I've already tried the method outlined here, which just gives me a left-justified matrix of all the activities. The columns have no meaning. If I try and pass columns to the DataFrame constructor, I get an error "26 columns passed, passed data had 9 columns. I believe that's because even though I have 26 discrete events, the most I've ever done in a simultaneous day is 9. Is there a way I can have it fill with 0/False if the column isn't found in that particular row? Thanks.

Before you turn it into a dataframe (or try to), what's the structure of the data? 3 lists and a list-of-lists (for activities)? — tel
– tel, Commented Dec 18, 2018 at 2:20
Well I just got flamed for writing a Perl script to do this (bad me! How dare I!). Okay now censored. Anyway its very trivial, even from base coding, construct a hash (oopss dictionaries), loop around the keys (I'm okay with that word right), into which you place an "if" statement. — M__
– M__, Commented Dec 18, 2018 at 2:42
@tel it's a csv export from a mental health app. I've had to manually do something like this before, with another similarly formatted file. It just seems like something that either pandas or sklearn might have shortcuts for. — sawyermclane
– sawyermclane, Commented Dec 18, 2018 at 2:53

It_is_Chris · Accepted Answer · 2018-12-18 04:24:32Z

7

You can simply use get_dummies

lets assume this dataframe:

df = pd.DataFrame({'datetime':pd.date_range('2017-08-27', '2017-08-29'),
              'mood':['good','meh','bad'],'activities':[['friends','party','gaming'],
                                                        ["work", "friends", "good food"],
                                                        ["work", "travel"]],
              'notes':[np.nan, 'stuff stuff','fell off my bike']})
df.set_index(['datetime'], inplace=True)

            mood      activities                notes
datetime            
2017-08-27  good    [friends, party, gaming]    NaN
2017-08-28  meh     [work, friends, good food]  stuff stuff
2017-08-29  bad     [work, travel]              fell off my bike

just concat and get_dummies:

df2 = pd.concat([df[['mood','notes']], pd.get_dummies(df['activities'].apply(pd.Series),
                                                      prefix='activity')], axis=1)


            mood    notes   activity_friends    activity_work   activity_friends    activity_party  activity_travel activity_gaming activity_good food
datetime                                    
2017-08-27  good    NaN             1               0                 0                 1                   0                   1                   0
2017-08-28  meh     stuff stuff     0               1                 1                 0                   0                   0                   1
2017-08-29  bad    fell off my bike 0               1                 0                 0                   1                   0                   0

You change change them to booleans if you want using loc:

df2.loc[:,df2.columns[2:]] = df2.loc[:,df2.columns[2:]].astype(bool)

edited Dec 18, 2018 at 4:24

answered Dec 18, 2018 at 3:56

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

LOrD_ARaGOrN Over a year ago

i am getting error while executing above code. KeyError: "['mood' 'notes'] not in index"

It_is_Chris Over a year ago

What does df.columns show? Make sure it is indeed ‘mood’ and not something like ‘mood ‘

kdd Over a year ago

I think this is incorrect, for example if you look at the column "activity_friends", there should be a 1 for "good" and "meh".

It_is_Chris Over a year ago

@kdd no, it is correct; there are just two activity_friends columns. If you want them concatenated simply df2.groupby(df2.columns, axis=1).sum()

kdd Over a year ago

Oh, I'm blind... I swear I looked for that. Thanks!

tel · Accepted Answer · 2018-12-18 07:28:20Z

2

Here's a complete solution, parsing of the messy output and all:

from ast import literal_eval
import numpy as np
import pandas as pd

# the raw data

d = '''datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"'''

# parse the raw data
df = pd.read_csv(pd.compat.StringIO(d), sep='\s*\|\s*', engine='python')

# parse the lists of activities (which are still strings)
acts = df['activities'].apply(literal_eval)

# get the unique activities
actcols = np.unique([a for al in acts for a in al])

# assemble the desired one hot array from the activities
actarr = np.array([np.in1d(actcols, al) for al in acts])
actdf = pd.DataFrame(actarr, columns=actcols)

# stick the dataframe with the one hot array onto the main dataframe
df = pd.concat([df.drop(columns='activities'), actdf], axis=1)

# fancy print
with pd.option_context("display.max_columns", 20, 'display.width', 9999):
    print(df)

Output:

    datetime    mood               notes  friends  gaming  good food  party  travel   work
0  8/27/2017  "good"                 NaN     True    True      False   True   False  False
1  8/28/2017   "meh"       "Stuff stuff"     True   False       True  False   False   True
2  8/29/2017   "bad"  "Fell off my bike"    False   False      False  False    True   True

edited Dec 18, 2018 at 7:28

answered Dec 18, 2018 at 2:55

tel

14k2 gold badges48 silver badges67 bronze badges

4 Comments

sawyermclane Over a year ago

This is it. I was specifically looking for something like

actrows = np.array([np.in1d(actdf.columns.values, a) for a in acts]);  actdf = pd.DataFrame(actrows, columns=actdf.columns)

user1889297 Over a year ago

for some reason I get error in acts = df['activities'].apply(pd.eval), and I don't know what is al in the line actcols = np.unique([a for al in acts for a in al])

tel Over a year ago

Odd. You'll probably be able to fix your issue by using the builtin eval instead of pd.eval. I actually edited the answer from one to the other since pd.eval should be a little more limited/safer while still getting the job done. If I were you, I would also check the version of my pandas package. It may be that upgrading will fix the bug (the latest version is v0.23.4)

tel Over a year ago

The line that creates actcols is an example of a nested list comprehension. acts is a list-of-lists (really a pd.Series-of-lists, but whatever), al is a single sublist, and a is a single activity.

Collectives™ on Stack Overflow

Split a pandas DataFrame column into OneHot/Binary columns

2 Answers 2

5 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related