1

I have a DataFrame I'm formatting for an SciKit Learn PCA looks something like this:

datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"

...and so on

I'd like to transform it to this, which I think will be better for ML work:

datetime |  mood |  friends | party | gaming | work | good food | travel |  notes

8/27/2017 |  "good" | True | True | True | False | False | False | NaN

8/28/2017 |  "meh" |  True | False | False | True | True | False | "Stuff stuff"

8/29.2017 | "bad" | False | False | False | False | True | False | True | "Fell off my bike"

I've already tried the method outlined here, which just gives me a left-justified matrix of all the activities. The columns have no meaning. If I try and pass columns to the DataFrame constructor, I get an error "26 columns passed, passed data had 9 columns. I believe that's because even though I have 26 discrete events, the most I've ever done in a simultaneous day is 9. Is there a way I can have it fill with 0/False if the column isn't found in that particular row? Thanks.

3
  • Before you turn it into a dataframe (or try to), what's the structure of the data? 3 lists and a list-of-lists (for activities)? Commented Dec 18, 2018 at 2:20
  • Well I just got flamed for writing a Perl script to do this (bad me! How dare I!). Okay now censored. Anyway its very trivial, even from base coding, construct a hash (oopss dictionaries), loop around the keys (I'm okay with that word right), into which you place an "if" statement. Commented Dec 18, 2018 at 2:42
  • @tel it's a csv export from a mental health app. I've had to manually do something like this before, with another similarly formatted file. It just seems like something that either pandas or sklearn might have shortcuts for. Commented Dec 18, 2018 at 2:53

2 Answers 2

7

You can simply use get_dummies

lets assume this dataframe:

df = pd.DataFrame({'datetime':pd.date_range('2017-08-27', '2017-08-29'),
              'mood':['good','meh','bad'],'activities':[['friends','party','gaming'],
                                                        ["work", "friends", "good food"],
                                                        ["work", "travel"]],
              'notes':[np.nan, 'stuff stuff','fell off my bike']})
df.set_index(['datetime'], inplace=True)

            mood      activities                notes
datetime            
2017-08-27  good    [friends, party, gaming]    NaN
2017-08-28  meh     [work, friends, good food]  stuff stuff
2017-08-29  bad     [work, travel]              fell off my bike

just concat and get_dummies:

df2 = pd.concat([df[['mood','notes']], pd.get_dummies(df['activities'].apply(pd.Series),
                                                      prefix='activity')], axis=1)


            mood    notes   activity_friends    activity_work   activity_friends    activity_party  activity_travel activity_gaming activity_good food
datetime                                    
2017-08-27  good    NaN             1               0                 0                 1                   0                   1                   0
2017-08-28  meh     stuff stuff     0               1                 1                 0                   0                   0                   1
2017-08-29  bad    fell off my bike 0               1                 0                 0                   1                   0                   0

You change change them to booleans if you want using loc:

df2.loc[:,df2.columns[2:]] = df2.loc[:,df2.columns[2:]].astype(bool)
Sign up to request clarification or add additional context in comments.

5 Comments

i am getting error while executing above code. KeyError: "['mood' 'notes'] not in index"
What does df.columns show? Make sure it is indeed ‘mood’ and not something like ‘mood ‘
I think this is incorrect, for example if you look at the column "activity_friends", there should be a 1 for "good" and "meh".
@kdd no, it is correct; there are just two activity_friends columns. If you want them concatenated simply df2.groupby(df2.columns, axis=1).sum()
Oh, I'm blind... I swear I looked for that. Thanks!
2

Here's a complete solution, parsing of the messy output and all:

from ast import literal_eval
import numpy as np
import pandas as pd

# the raw data

d = '''datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"'''

# parse the raw data
df = pd.read_csv(pd.compat.StringIO(d), sep='\s*\|\s*', engine='python')

# parse the lists of activities (which are still strings)
acts = df['activities'].apply(literal_eval)

# get the unique activities
actcols = np.unique([a for al in acts for a in al])

# assemble the desired one hot array from the activities
actarr = np.array([np.in1d(actcols, al) for al in acts])
actdf = pd.DataFrame(actarr, columns=actcols)

# stick the dataframe with the one hot array onto the main dataframe
df = pd.concat([df.drop(columns='activities'), actdf], axis=1)

# fancy print
with pd.option_context("display.max_columns", 20, 'display.width', 9999):
    print(df)

Output:

    datetime    mood               notes  friends  gaming  good food  party  travel   work
0  8/27/2017  "good"                 NaN     True    True      False   True   False  False
1  8/28/2017   "meh"       "Stuff stuff"     True   False       True  False   False   True
2  8/29/2017   "bad"  "Fell off my bike"    False   False      False  False    True   True

4 Comments

This is it. I was specifically looking for something like actrows = np.array([np.in1d(actdf.columns.values, a) for a in acts]); actdf = pd.DataFrame(actrows, columns=actdf.columns)
for some reason I get error in acts = df['activities'].apply(pd.eval), and I don't know what is al in the line actcols = np.unique([a for al in acts for a in al])
Odd. You'll probably be able to fix your issue by using the builtin eval instead of pd.eval. I actually edited the answer from one to the other since pd.eval should be a little more limited/safer while still getting the job done. If I were you, I would also check the version of my pandas package. It may be that upgrading will fix the bug (the latest version is v0.23.4)
The line that creates actcols is an example of a nested list comprehension. acts is a list-of-lists (really a pd.Series-of-lists, but whatever), al is a single sublist, and a is a single activity.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.