3

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below

    ID          SKILL     NUM
    1             A        1
    1             A        1
    1             B        1
    2             C        1
    3             C        1
    3             C        1
    3             E        1

The results I am aiming for is:

    ID    A         B        C    E
    1     1         1        0    0
    2     0         0        1    0
    3     0         0        0    1

My code atm get the following result:

    ID    A         B        C    E
    1     2         1        0    0
    2     0         0        2    0
    3     0         0        0    1

Should I remove the duplicates first??

The code I'm using atm is below;

  df_pivot =  df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)

3 Answers 3

2

You can use get_dummies with set_index for indicator columns and then get max values per index:

df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)

For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:

df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)

Solution with pivot, but then is necessary replace NaNs to 0:

df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)

If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:

df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID', 
                            columns='SKILL', 
                            values='NUM',
                            aggfunc=len, 
                            fill_value=0))
print (df_pivot)
SKILL  A  B  C  E
ID               
1      1  1  0  0
2      0  0  1  0
3      0  0  1  1
Sign up to request clarification or add additional context in comments.

4 Comments

The unstack is really fast tks.. How do I keep the ID as a variable not as a index on this case ?
@Ian_De_Oliveira - Last need df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
Tks a lot , the unstack function worked really well.. Know I need to learn how to work with big data because my computer did not like the idea to flip 55000 columns , 2500 did ok but 55000 atm only SAS is been able to.
@Ian_De_Oliveira - It depends of RAM in your PC, but for working best with large data in pandas check this
1

Try like this:

df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)

Or this:

df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)

Whichever suits you best.

Comments

1

You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.

df_pivot =  df.pivot_table(index='ID', columns='SKILL',
                           aggfunc='any', fill_value=0).astype(int)

print(df_pivot)

      NUM         
SKILL   A  B  C  E
ID                
1       1  1  0  0
2       0  0  1  0
3       0  0  1  1

The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

1 Comment

It worked, tks. R would not flip my table because is too big.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.