Reshaping into binary variables using pandas python

Question

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below

    ID          SKILL     NUM
    1             A        1
    1             A        1
    1             B        1
    2             C        1
    3             C        1
    3             C        1
    3             E        1

The results I am aiming for is:

    ID    A         B        C    E
    1     1         1        0    0
    2     0         0        1    0
    3     0         0        0    1

My code atm get the following result:

    ID    A         B        C    E
    1     2         1        0    0
    2     0         0        2    0
    3     0         0        0    1

Should I remove the duplicates first??

The code I'm using atm is below;

  df_pivot =  df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)

jezrael · Accepted Answer · 2018-07-03 08:30:13Z

2

You can use get_dummies with set_index for indicator columns and then get max values per index:

df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)

For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:

df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)

Solution with pivot, but then is necessary replace NaNs to 0:

df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)

If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:

df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID', 
                            columns='SKILL', 
                            values='NUM',
                            aggfunc=len, 
                            fill_value=0))
print (df_pivot)
SKILL  A  B  C  E
ID               
1      1  1  0  0
2      0  0  1  0
3      0  0  1  1

edited Jul 3, 2018 at 8:30

answered Jul 3, 2018 at 8:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ian_De_Oliveira Over a year ago

The unstack is really fast tks.. How do I keep the ID as a variable not as a index on this case ?

jezrael Over a year ago

@Ian_De_Oliveira - Last need

df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)

Ian_De_Oliveira Over a year ago

Tks a lot , the unstack function worked really well.. Know I need to learn how to work with big data because my computer did not like the idea to flip 55000 columns , 2500 did ok but 55000 atm only SAS is been able to.

jezrael Over a year ago

@Ian_De_Oliveira - It depends of RAM in your PC, but for working best with large data in pandas check this

zipa · Accepted Answer · 2018-07-03 07:33:39Z

1

Try like this:

df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)

Or this:

df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)

Whichever suits you best.

answered Jul 3, 2018 at 7:33

zipa

28k6 gold badges45 silver badges62 bronze badges

Comments

jpp · Accepted Answer · 2018-07-03 08:03:27Z

1

You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.

df_pivot =  df.pivot_table(index='ID', columns='SKILL',
                           aggfunc='any', fill_value=0).astype(int)

print(df_pivot)

      NUM         
SKILL   A  B  C  E
ID                
1       1  1  0  0
2       0  0  1  0
3       0  0  1  1

The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

answered Jul 3, 2018 at 8:03

jpp

166k37 gold badges301 silver badges363 bronze badges

1 Comment

Ian_De_Oliveira Over a year ago

It worked, tks. R would not flip my table because is too big.

Collectives™ on Stack Overflow

Reshaping into binary variables using pandas python

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related